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FOREWORD 


In my opinion, this Bulletin by Dr. Jackson and Dr. Ferguson 
gives the most penetrating analysis of the problem of reliability yet 
made. Reliabilitythey prove conclusively, is not the simple concept 
it was once considered. There is no such thing as the reliability of a 
test, but only the reliability of a test in a specified situation. The 
rather technical experimental analyses of the earlicr chapters lead to 
the practical recommendations of Chapter VII. 


In these recommendations regarding the reporting of data relating 
to the reliability of tests, attention is drawn to the necessity, (а) of 
estimating both the absolute and the relative accuracy of the test; (5) 
of drawing a distinction between reliability as usually understood and 
the internal consistency of the test; (с) of choosing the best method of 
analysing the experimental data of the test (the authors recommend 
the analysis of variance and covariance method in preference to a 
correlation technique); and (d) a combinatorial reliability analysis for 
tests made up of a battery of sub-tests. 

We must confess that in our practices in the Department of 
Educational Research we have fallen short of these high ideals set by 
the authors. We shall not wear a hair shirt or sprinkle the head with 
ashes for we lived up to the lights we had. Now that further illumi- 
nation has been given, we shall try to live up to these higher and 
brighter standards. 


PETER SANDIFORD 


University of Toronto 
October 1941. 
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CHAPTER I 


REVIEW OF LITERATURE ON TEST RELIABILITY 


The term reliability was first introduced into mental test theory 
by Spearman in 1904. Since that time the literature dealing with test 
reliability has grown to comprehensive dimensions. The initial im- 
petus given to the subject by the papers of Spearman [105, 106], Brown 
[7], and Kelley [57] was followed by a large number of empirical enquir- 
ies which contributed little or nothing either in the way of clarification 
or development to the fundamental reliability concepts. Recent pub- 
lications, however, have reflected a tendency towards more rigorous 
examination of the fundamental assumptions involved in reliability 
theory, and some significant progress has been made in the application 
of analysis of variance methods to reliability problems. 

No attempt has been made in the preparation of the present brief 
review of literature to include all articles that relate in one way or 
another to the subject-matter of test reliability. Such a task would be 
not only laborious but unprofitable. We have, however, included all 
studies which in our opinion have made significant contributions to 
the subject. 


The Spearman-Brown Prophecy Formula 

The formula commonly used for estimating increase in reliability 
with increase in the length of test was derived independently by 
Spearman and Brown, and published by them simultaneously in the 
British Journal of Psychology, October, 1910. Since that time a large 
number of empirical enquiries have been carried out to determine the 
applicability of this formula. The earliest of these studies was that 
reported by Holzinger [45]. He administered forms A and B of the 
Terman Group Test of Mental Ability, a test consisting of ten com- 
ponent parts, to 135 pupils, and calculated a reliability coefficient for 
cach component by correlating the parallel components of the two 
forms. The average reliability of the ten components was used in the 
Spearman-Brown formula to determine a series of theoretical values 
which were compared with the obtained reliabilities of the cumulated 
components. These data seemed to indicate that the Spearman- 
Brown formula tended to over-predict the reliability of a test. This 
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conclusion in Holzinger's investigation is in part invalidated due to 
the lack of equivalence of the component parts of the test used. Con- 
sequently the assumptions on which the formula is based are obviously 
not satisfied. 

Holzinger and Clayton [46] on the basis of data obtained from two 
forms of the Otis Self-Administering Test of Mental Ability, divided 
into one and one-half minute time intervals, and on the basis of addi- 
tional spelling test data, concluded that when the 
accurately calibrated, the Spearm 
lent prediction. 

Kelley [60], using data published by Gordon H1] on the reliability 
of judgements of lifted weights, concluded that in this context the 
Spearman-Brown formula predicted with a Е 

Ruch, Ackerson, and Jackson [90] conducted an empirical study of 
the Spearman-Brown formula using spelling test material. They con- 
cluded that when homogeneous test material is used, yielding equal 
standard deviation and equal reliability of component tests, the 
Spearman-Brown formula gave meaningful prediction. 

Remmers, Shock, and Kelley [82] studied the application of the 
Spearman-Brown formula in predicting the reliability of any given 
number of judgements, They concluded that this formula predicted 
to within two probable errors the reliability obtained by experiment 
up to and including thirtcen judges, the limit of their data. Remmers 
[83] reported that there was some evidence, although this evidence was 
not conclusive, to indicate that in the majority of situations in which 
subjective judgements were used, the Spearman-Brown formula indi- 
cated the number of judgements required for a given reliability. 

Jordan [56] concluded that reliability coefficients secured by cor- 
relating odd and even items were higher than those secured by cor- 
relating scores on duplicate forms. He argued that the reliability 
coefficients derived by correlating the odd and even items were prob- 
ably better measures of the reliability of the te 
bility was eliminated. | 


Remmers and Whistler [84] reported findings similar to those of 
Jordan, and discussed the influence of using reliability coefficients 
calculated by different methods in formulae that fave Yea measireof 


test reliability such as the formula for correcting a correlation coeffi- 
cient for attenuation, 


Ferguson [30] presented data Which indicated th 
between the split-halves of tests given on the sam 
than the split-halves of tests given on differe 


test material was 
an-Brown formula furnished excel- 


air measure of accuracy. 


st because pupil varia- 


at the correlations 
с day were higher 
nt days, and carried outa 
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factorial analysis in which the differences between the correlation co- 
efficients were isolated as factors. 

Wherry [124] contended that the Spearman-Brown formula, when 
used to estimate the reliability of a test, yielded results that contained 
constant and chance errors, and derived a correction formula. Although 
certain evidence was presented to support the use of this correction 
formula, the logical presumptions of its derivation are uncertain. 

Other writers on the Spearman-Brown formula are Lanier [64], 
Slocombe [99, 100], and Thurstone [115]. 

The experiments carried out on the Spearman-Brown formula may 
be classified into two groups: (1) those that are concerned with pre- 
dicting the reliability of a test lengthened any number of times, and 
in which no assumptions regarding the splitting of a test are made, and 
(2) those that are concerned with the use of the formula in estimating 
the reliability of a test from the correlation between the split-halves. 
The experiments of the first group indicate that if the assumptions are 
satisfied, the formula will yield accurate prediction, which of course it 
must. These assumptions are, firstly, that the standard deviations of 
component tests are equal, and, secondly, that all the intercorrelations 
between component tests are equal. The majority of investigations 
have concerned themselves with reliability coefficients estimated by 
this formula, rather than with the problem of determining whether the 
conditions for its valid use were satisfied. 

In using the Spearman-Brown formula to determine the reliability 
of a test by boosting the correlation between the split-halves, the 
empirical evidence supports the conclusion that such coefficients are 
usually, although not always, higher than reliability coefficients ob- 
tained by correlating equivalent forms. This is due to no intrinsic 
fault in the Spearman-Brown formula, but rather to the process of 
splitting the test. The so-called split-half reliability coefficients are 
measures of the internal consistency of tests, and such coefficients do 
not always bear a one to one relationship to reliability coefficients 
obtained by administering equivalent forms. 


The Standard Error of the Spearman-Brown Formula 

A formula for the standard error of the Spearman-Brown formula 
was first published by Shen [94]. Тһе publication of this paper was 
followed by a discussion on the correctness of Shen's formula, and a 
number of alternative formulae were developed. Papers were contri- 
buted to this discussion by Holzinger and Clayton [46], Shen [95], 
Douglass [23, 24], and Holzinger [47]. 
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Shen's standard error formula requires in its derivation the formula 

1—2 

VN 

Since the distribution of for high values of p is very decidedly Skewed, 
the above formula cannot be used in tests of significance or in esti- 
mation in the usual way for a correlation coefficient of high magnitude 
without serious error. Since the great majority of reliability coeffi- 
cients, which involve the Spearman-Brown formula in their compu- 
tation, are of the order .8 and .9, it is clear that Shen's formula will 
yield very inaccurate estimates of the required confidence intervals. 
It is questionable whether the literature on the standard error of the 
Spearman-Brown prophecy formula is of little more than historical 

. interest. 


g, = 


The Kuder-Richardson Formulae 


Kuder and Richardson [63] contributed an interesting development 
to reliability theory by deriving formulae for the estimation of relia- 
bility coefficients from statistics commonly computed in the selection 
of test items. This method is described as the method of rational 
equivalence. In the calculation of reliability coefficients by one of 
their more useful formulae, referred to as formula 20, the information 
required is the number of items, the test variance, and the sum of the 
item variances. On the assumption that all the items are of the same 
difficulty, a further formula is obtained which requi 
the standard deviation, and the number of items t 
cient of reliability. 

The Kuder-Richardson method of estimating the reliability of tests 
yields measures of internal consistency rather than measures of reli- 
ability, if reliability is understood in the test-retest sense. Coefficients 
estimated by the Kuder-Richardson formula 20 are superior to coeffi- 
} cients obtained by the split-half method, because any error due to a 

bias in splitting a test is eliminated. Ferguson [31] and Dressel [25] 
furnished different derivations of the Kuder-Richardson formula 20. 


Casanova [10] provided a variant of this formula adapted to compu- 
tational purposes. 


res only the mean, 
о estimate a coeffi- 


Battery Reliability 

The reliability of test batteri 
Cozens [22]. They pointed out tha 
batteries Spearman's formula for 
used, unless the sub-tests are quite 


es was studied by Douglass and 
tin computing the reliability of test 
the correlation of sums should be 
similar in measuring capacity; that 
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is, unless all sub-tests are regarded as parallel forms. Variations of the 
Spearman formula for different weighting systems are given. 

Thomson [111] developed the theory of maximum battery relia- 
bility. He employed Hotelling's [49] solution to the general problem 
of obtaining weights that would maximize the correlation between two 
sets of variates to the specific problem of obtaining weights that would 
maximize the reliability of a test battery. He discussed also the rela- 
tionship between maximum prediction of a criterion and maximum 
reliability. 


Analysis of Variance 

Jackson [52] applied analysis of variance methods and the methods 
of testing statistical hypotheses developed by Neyman and Pearson to 
the problem of determining the reliability of tests. In this paper 
methods are developed for treating four different problems: (1) the 
determination of the existence of a significant practice effect, (2) the 
determination of whether or not the test measures the capacity of the 
individuals tested, (3) the estimation of practice effect if it is found 
to exist, and (4) the estimation of the relative importance of the random 
errors of measurement with respect to the true measurement of the 
capacity of the individual. A new statistic termed the sensitivity of a 
test is introduced, denoted by the letter y, and defined as the ratio of 
the standard deviation of true scores to the standard deviation of the 
distribution of errors of measurement. This sensitivity ratio is more 
informative than the usual reliability coefficient as a statistic descrip- 
tive of test efficiency, firstly because it is easier to interpret, and sec- ' 
ondly because as the ratio of two standard deviations it exists on a 
scale in which the units are equal. 

Jackson [53] applied the analysis of variance and covariance to 
problems of determining the effect of combining data from different 
classes on the estimates of reliability, and the conditions which must 
be satisfied before such results may be combined. 

A very significant contribution to reliability theory was made by 
Hoyt [51] who developed a formula for estimating the reliability of a 
test by analysis of variance methods. Hoyt showed that for any par- 
ticular test by subtracting the sum of squares among individuals and 
among items from the total sum of squares a residual sum of squares 
is obtained which may be used to estimate the discrepancy between 
the obtained variance and the true variance. These data may be used 
to estimate the reliability of a test. It is probable that Hoyt's work 
will permit of further interesting development. 
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Factors Influencing the Reliability of Tests 

The reliability of a test is determined by a wide varicty of causes. 
Symonds [103] listed 25 factors which influence the reliability of tests. 
Of these factors a number have been subject to specific study including 
the following: (1) the difficulty of the test items, (2) number of re- 
sponses in items of the multiple-response type, (3) practice effect, 
(4) function fluctuation, (5) the variability of the group tested. 


Influence of Пет Difficulty on Test Reliability 


The influence of the difficulty of test items on test reliability has 
been studied by Symonds [104] and Thurstone [117]. Symonds pre- 
sented convincing argument to show that a test made up of items of 
.5 difficulty value measured an individual most accurately, and that 
the best test for measuring a school grade was made up of items that 
could be answered with 50 per cent accuracy by the average individual. 
Thurstone reported an investigation on the relationship between the 
diagnostic value of a test and the difficulty values of the items com- 
posing it, the diagnostic value being defined as the correlation between 
scores on sub-tests containing items at different levels of difficulty with 
total scores summed over all sub-tests. The conclusion was that the 
diagnostic value of a test, and, therefore, its reliability, was a maxi- 
mum when the items were about 50 per cent difficulty. The diag- 


nostic value was found to decrease when the difficulty of the items 
departed from this 50 per cent level 


Influence of Number of Item Responses on Test Rel 


е 3 
А W orkers in the field of test construction have long realized that 
increasing the number of alternative responses in items of the mul- 
tiple-response type increased the reliability of the test, the argument 


being that the reliability was increased by decreasing the probability 
of making a score by guessing alone, 


Asker [4] approached this proble 


iability 


BP : m from the point of view of сіс- 
mentary probability. Sims and Knox [97] studied the reliability of 


multiple-response tests when Presented orally, and found with their 
data that a test composed of four-response items presented orally was 


more reliable than a test composed cither of three or five responses 
presented orally or five responses present 


Remmers and others [84, 85, 19] forn 
increase in the reliability of tests with 
responses was a function of the Spearm 
the argument being that doubling, for 


ed visually. 

nulated the hypothesis that 
increase in the number of 
an-Brown prophecy formula, 
example, the number of re- 
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sponses was equivalent to doubling the length of the test. This hy- 
pothesis was tested on data of different types. Some results were in- 
conclusive; others seemed to corroborate the hypothesis. The present 
writers feel that no intrinsic one to one relationship exists between the 
lengthening of a test and increasing the number of item responses, 
although such an hypothesis may in practice frequently describe the 
data. 

Ferguson [31] derived formulae based on certain probability con- 
siderations, whereby the increase in reliability with increase in the 
number of alternative responses could be estimated. The assumptions 
upon which these formulae are based may frequently not be satisfied 
in practice. These formulae represent a rationalization of the problem. 


Influence of Practice Effect on Test Reliability 

Reliability coefficients calculated after different amounts of prac- 
tice have been published by Gates [39], Gundlach [13], Slocombe [101], 
and Anastasi [3]. Anastasi pointed out that practice increased the 
effectual length of the test; consequently the item difficulty values 
change. Since the reliability of a test is a function of the difficulty 
values of the items, the reliability being a maximum when all items are 
of .5 difficulty, any factor which influences the difficulty values will 


influence the reliability. 


Test Reliability and Function Fluctuation 

The variability of cognitive function, sometimes termed function 
fluctuation, is a factor which may influence the magnitude of 
reliability coefficients calculated by administering equivalent forms 
of a test with a time interval between the two administrations. 
Thouless [114] derived an index for the measurement of function fluc- 
tuation. Paulsen [79] suggested that functional variability was re- 
sponsible for the discrepancy between reliability coefficients calculated 
by the split-half method and coefficients obtained by correlating equi- 
valent forms with a time interval between the two administrations. 
He suggested that the correlation between equivalent forms could be 
corrected for attenuation, using the split-half reliabilities in.the denomi- 
nator of the attenuation formula, and the coefficient thus corrected 
used as a coefficient of trait variability. 

With verbal test material function fluctuation in the experience 
of the present writers seems to have very little influence on the magni- 
tude of the reliability coefficient calculated by correlating equivalent 
forms of a test after a time interval. With other types of material its 
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influence оп the reliability coefficient may be more pronounced. Fur- 
thermore the discrepancy between split-half reliability ‘coefficients and 
coefficients calculated by correlating equivalent forms is attributable 
largely to factors other than function fluctuation. 


Influence of Variability of Group Tested on Reliability Coefficient 
The magnitude of a reliability coefficient is a function of the hetero- 
geneity of the group tested. Kelley [57] developed a formula for deter- 
mining the reliability of a test in one range of ability given its relia- 
bility in another range of ability. Cureton and Dunlap [14] published 
а nomograph to facilitate the use of this formula. Rulon [91] published 
a graph to serve the same purpose. Dunlap and Cureton [27] devel- 
oped a formula for determining the standard error of a reliability co- 
efficient estimated from a coefficient for a different range of ability. 


Further Research on Test Reliability 

Despite the fact that investigations into the reliability of tests, and 
closely associated topics, have been comprehensive in character, there 
is every indication that numerous significant developments remain to 
be made in this field. 

Firstly, more rationalization of problems which have hitherto been 
approached purely by empirical methods is required. An example of 
this type of development may be cited. The reliability of a test is a 
function of the variance of scores relative to a defined population. 
This variance is in turn a function of the error variance, the variance 
of difficulty values, and the mean test score. Thus if the variance of 
difficulty values is decreased, the variance of scores in the defined 
population is increased; consequently the reliability of the test is in- 
creased. Test makers have for some time been aware of the existence 
of a functional relationship between the reliability of a test, the vari- 
ance of test scores, the variance of difficulty values, and the mean test 
score, but the precise nature of this apparently complex relationship 
was unknown. Very recently, however, equations were derived in our 
Research Department which showed the precise nature of this rela- 
tionship; hence we are now able to estimate the changes that will 
occur in the variance of scores and in the reliability when by the 
removal or addition of certain items the mean score is changed, or 
the variance of difficulty values is changed, or both are changed 
simultaneously. While the solution of such small problems as this 
may in themselves be of no great importance, such developments 
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are essential in the attainment of a suitable rationale underlying the 
methodology of test construction. 

The work of Kuder and Richardson [63], and of Hoyt [51] has given 
rise to a number of problems which require investigation. The co- 
efficients derived by the methods suggested by these writers are indices 
descriptive of the internal consistency of tests. The relationship be- 
tween such coefficients and the reliability of tests obtained by test- 
retest methods requires to be established. In experimenting on this 
problem it is not sufficient to design experiments merely to show the 
relationship between the two types of coefficients. Care must be taken 
to determine the causes of any discrepancy. 

Some research may profitably be carried out on the relationship 
between various methods of item selection and test reliability, which, 
while not too laborious arithmetically, will select the most reliable 
battery of items. 

The reliability of different types of test material requires further 
investigation. The determination of an error variance which is in 
large measure a characteristic of the type of test material, and inde- 
pendent either of the variability of the group or of the differences in 
difficulty between items, should now be possible by analysis of variance 
methods. 

The influence of functional variability on reliability coefficients 
determined by test-retest methods has not yet received adequate 
attention. Such variability may be a characteristic of the type of test 
material used. 

The above short summary includes only a few of the possible 
aspects of test reliability which remain to be investigated. Some of 
these problems are at present being studied in the Department of 
Educational Research, University of Toronto, and the results will be 


published in the near future. 
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СНАРТЕК П 


THE CONCEPT OF RELIABILITY 


Measurement is as important in education as in the other sciences. 
In many, although not all, fields of scientific enquiry experiments are 
designed to demonstrate the truth or falsity of some hypothesis. 
Measurement of the character or characters of the population specified 
by the hypothesis is an essential part of the experiment. An analysis 
of the experimental results will, if the experiment has been designed 
properly, generally enable us to state whether or not the hypothesis is 
true—at least in the particular situation we have chosen to consider. 
Our statement concerning the truth or falsity of the hypothesis is based 
on a comparison of the experimental results with the results to be 
expected if the hypothesis is true. It will be seen that, even if we 
assume the experiment has been designed correctly and adequately 
controlled, the validity of our statement concerning the truth or falsity 
of the hypothesis will depend on the accuracy of the measurements. 

Weare, therefore, largely at the mercy of the measuring instrument 
we choose toemploy. Tests or examinations are our Measuring instru- 
ments in the field of education. There are many kinds of tests and 
examinations but they have one feature in common: namely, that they 
are designed to measure some ability or capacity of individuals or 
groups of individuals, Since, in this study, we are interested in 
tests only as measuring instruments, it is immaterial for our purpose 
whether the test is, for example, an essay examination, a new-type 
achievement test or an intelligence test. We are concerned with their 
general, not their specific, purpose, and shall speak of “tests” in the 
general sense of the term. The results given here are, of course, appli- 
cable to all types of tests. 

If we knew the true value of a character there would be no need for 
measurement. It follows that when we 


admit the necessity of meas- 
uring some character we 


admit that we do not know the 
It is not always made clear that the measurement obtained is not 
necessarily exactly equal to the true value, The value we obtain is 
only an estimate, and as such is subject to error. A measuring instru- 
ment is not perfect, and, when we use it in measuring, we make errors. 


Let us assume, for example, that we are measuring the length of a 
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true value. 
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piece of wood with a yard-stick. If we make several independent 
measurements of the object, the values obtained may not differ very 
much but they will not all be the same. If we repeat the measure- 
ments, using another yard-stick, we shall probably obtain very similar 
results. It may happen, however, that our second set of measure- 
ments are on the average larger than the first. Perhaps our second 
yard-stick has been graduated incorrectly, or has shrunk. In this case 
we should say that our second set of measurements is biased, i.e. there 
is a constant error in addition to the usual random errors of measure- 
ment. This bias effect is not usually classed as an error of measure- 
ment, but, from the point of view of estimating true values, it is an 
error effect. These two effects, bias and random errors, are generally 
independent of each other and will be treated as such in the following 
discussion. 

A particular measuring instrument is not necessarily equally ap- 
propriate for use in all situations. If we were measuring a group of 
objects which differed in length by as much as six inches, for example, 
an ordinary yard-stick would probably be considered a satisfactory 
instrument. It would not be satisfactory, however, if our objects 
differed in length by not more than one-tenth of an inch. The position 
here is different from the one considered previously in that we are 
using the measurements as estimates of the true lengths of the objects 
and also to distinguish between the objects. An instrument which is 
satisfactory for one purpose may be of little value for another, or inap- 
propriate in a different situation. We can, therefore, judge whether 
or not a measuring instrument is satisfactory only when we know the 
purpose for which it is to be used, and the conditions under which it is 
to be used. 

The tests used as measuring instruments in education and psy- 
ally more inaccurate than the instruments used in 
other sciences. In planning experiments and in interpreting experi- 
mental results in this field, therefore, a knowledge of the accuracy or 
inaccuracy with which our instruments measure is essential. The 
ideas discussed above in relation to physical measurements seem to 
the writers to be fundamental and will be applied to the problems of 
ment in the following discussion. There is not an 
t some of the basic concepts are common 


chology are gener 


mental measure 


exact analogy, of course, bu à 
to both fields. Additional problems enter into mental measurement: 


c difficulty of graduating our instruments and the pos- 
(i.e. the individuals) may change, 
etween measurements or while 


for example, th 
sibility that the objects we measure 
at least with regard to our test, either b 
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being measured, or may be changed by the very process of measuring. 
We find also that a particular test is not necessarily equally satisfactory 
for measuring individuals in different groups; for example, individuals 
of different chronological ages or in different grades. In physical 
measurements, on the other hand, we can always say that a particular 
instrument measures with a fixed degree of accuracy irrespective of 
the group of objects measured. The difference seems to be due partly 
to the fact that tests are designed to measure ability at a particular 
stage in the growth of the individual. This limitation is introduced 
in the construction of the tests. Short, easily administered group tests 
are demanded, and, indeed, are the only type suitable for the ordinary 
testing programme. They, however, sample only a limited range of 
ability. We may, as is done in individual tests, arrange a series of 
tests which will cover a number of such stages but this introduces 
difficulties in the construction and use of the test and we must still 
determine how accurately it measures at different points. It is clear 
that under these circumstances the tests (group or individual) will be 
satisfactory measuring instruments for only a fairly well-defined range 
of ability. Our attention will be confined mainly to group tests but 
it is obvious that the results are, with a slightly different interpretation, 
equally applicable to individual tests. 

Tests, like other instrum 
Whatever the specific ригроз 
either as 


ents, may be used for many purposes. 
е, the scores obtained on the test are used 


(1) estimates of the true scores of the individuals, 
or 


(2) estimates of the abilit 
others in the group, 
of individuals and 
groups of individuals. 


These categories are not altogether inde 
times use the same scores for both pur 


y of individuals relative to that of the 
i.e. used in estimating the relative ability 


pendent, and we may some- 


the accuracy or adequacy of tests. 
problems of measuring height and in 
venience of the division. 


Let us consider, first, the problem of measuring the height and 
intelligence of an individual or, more exactly, of estimating his true 
height and intelligence. It is sufficient for Ordinary purposes if we 
know an individual's height to the nearest inch and whether he is of 
normal, above normal or sub-normal intelligence. There is, of course, 
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in distinguishing between individuals or 
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always the difficulty of deciding border-line cases, but in most instances 
an ordinary measuring instrument, such as a yard-stick for measuring 
height and a group test for measuring intelligence, would probably be 
considered adequate. If a particular level of accuracy is desired and 
specified, for example if the height is to be measured to the nearest 
tenth of an inch, we must choose an instrument which satisfies this 
requirement. It is most important in these cases to make certain that 
our measurements are not biased, or, if they are biased, that the 
amount of bias is known in order that a correction may be applied to 
the original measurements. Otherwise, the measurements and the 
magnitude of the errors of measurement will be all the information 
which it is necessary to give. When we say, for example, that the 
height of an individual is 5 feet, 10 inches and his 1.0. rating is 130 
points, we do not mean that his height is exacily 5 feet, 10 inches or 
that he has an I.Q. of exactly 130 points. We mean that his height 
is approximately 5 feet, 10 inches and his І.О. is approximately 
130 points. These may be exactly equal to the true values, they 
may be too high or they may be too low; we simply do not know 
which is the case. If we know the magnitude of the errors of 
measurement, however, we can specify a certain range, for example 
115 to 145 1.0. points, and state that this range will cover the true 
In making this statement we know that we shall be correct in 
say in 99 out of 100 cases, depending 
on the degree of accuracy desired. The magnitude of the errors of 
measurement, or rather a comparison of these errors with the degree 
of accuracy required, will determine whether or not a test is adequate 
and satisfactory in a particular situation. 

The position is very different when we consider the problem of 
distinguishing between individuals or groups of individuals. If, for 
example, the individuals differ in height by not more than one-tenth 
of an inch, then an ordinary yard-stick graduated in inches and quar- 
ters of an inch would not be an adequate or satisfactory instrument to 
use. We would not be able to rank the individuals according to 
height with any degree of confidence. Similarly, when we are com- 
paring two or more groups we can judge whether our measuring instru- 
ment is adequate or satisfactory only if we know, or can determine, 
the size of the differences between the groups. If these are large, then 
even an inaccurate instrument might be satisfactory for the purpose 
of distinguishing between the groups although it would be of little 
value for the purpose of distinguishing between individuals in the same 
group. A test need not be very accurate, for example, if all we require 


value. 
a certain fixed proportion of cases, 
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is that it will enable us to distinguish between a group of morons and 
a group of individuals of very high intelligence. If, on the other hand, 
our groups are more nearly equal in ability, our test may not be 
accurate enough to enable us to detect the difference between them. 
In this case the test would not be satisfactory for the purpose of dis- 
tinguishing between either the groups or the individuals within the 
groups. If our groups are heterogeneous and the difference between 
them is small, it may happen that our test is accurate enough to enable 
us to distinguish satisfactorily between individuals within the groups 
but is not accurate enough to enable us to distinguish between the 
groups. In such a case we should probably conclude that there is no 


difference between the groups, but it would be more exact to say that 
there may be a difference but our test is n 


ot sufficiently accurate to 
enable us to detect it. 


The question of bias is not so important in 
problems of this kind. If we may assume, as we generally do, that 
this effect is constant for all individuals, then our comparison between 
individuals or groups of individuals will be unaffected. It is important 
only when we are interested in estimates of the true scores of the indi- 
viduals or groups of individuals. 

We are interested, therefore, in both the magnitude of the errors 
of measurement and the relation between the size of the errors and the 
size of the differences between the objects measured. In other words, 
we are interested in both the absolute and the relative accuracy of our 
measurements. There is also the question of bias, i.e. the constant 


correction to be applied to our measurements. 


Definition of Reliability 


The term “reliability” is used in psychological and educational 
work, and we customarily speak of the “reliability of a test” or other 
measuring instrument. According to Walker [120], this was intro- 
duced in 1910 by Spearman, who defined the term "reliability coeffi- 
cient" as the (correlation) "coefficient between one-half and the other 


half of several measurements of the same thing.” Ina later work [108], 
Spearman defines reliability as follows: 


а 


reliability; this means the amo 
two or more ratings of the same kind.” 

In the beginning, therefore, reliability and correlation were con- 
nected, and this connection has not yet been broken. Since the inter- 
pretation of correlation coefficients is rather difficult, this connection 
has not been a happy one and has tended to confuse rather than clarify 
the issue. The difficulty is that the correlation coefficient is a measure 


unt of correlation between 
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of the degree of relationship between two (or more) variables, i.e. a 
measure of the agreement, whereas in determining the accuracy of our 
measurements we are more interested in the disagreement of the results, 
i.e. in a measure of the departure from a perfect relationship. These 
measures are related, of course, but the connection with correlation has 
meant an indirect rather than a direct approach to the problem. 
This confusion can best be illustrated by a comparison of some of 
the definitions of reliability in common use. One of the definitions 
widely used is that given by Otis [78]: 
“The term reliability is used technically in connection with tests 
to mean the degree to which a test is consistent in measuring that 


which it measures.” 
A criticism of this definition is that it defines reliability in terms of 
consistency, which does not help very much as the term "consistent" 
or "consistency" must also be defined. The definition given by McCall 
[70] is similar: 

“By reliability of a test is meant the amount of agreement between 

results secured from two or more applications of a test to the same 


pupils by the same examiner." 
The difficulty here is to determine what is meant by "amount of 
agreement" ; this seems to be similar to the idea of the “degree of rela- 


tionship” which underlies correlation. қ 
In contrast to these, we find that Kelley [61] gives the following 


definition: 
a — the question of reliability is that of how accurately a test 
measures the thing which it does measure.” 
Sandiford [93] gives a similar definition: 
“By reliability is meant the accuracy with which the test measures 
whatever it does measure. It is, therefore, synonymous with 
accuracy in measurement.” 


The definition given by Monroe [71] relates reliability directly to the 


errors of measurement: f y 
“The reliability of a test refers to the magnitude of the differences 
between the obtained scores and the true scores. These differences 


” 
are the variable errors of measurement. 
Thurstone [116] gives a similar definition: 


“A test that is subject to relatively small chance factors in its 
score is said to be reliable while a test with considerable variation 


from one occasion to another is said to be unreliable. 


Here again the difficulty is to determine what is meant by “relatively 
small” and "considerable". Later, оп page 3, Thurstone [116] distin- 
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guishes between the “еггогв of measurement” and the “relative sta- 
bility of the scores", and says that the reliability coefficient describes 
the relative stability of the scores. 

It seems clear that there is some disagreement among the various 
authorities on the definition of reliability. On the one hand, we find 
the emphasis placed on the accuracy of the test or the errors of mea- 
surement, and, on the other hand, the emphasis is placed on the con- 
sistency of the measurements or the relative stability of the scores. 
The necessity of considering the relation between the errors of mea- 
surement and the size of the differences between the individuals tested 
has been stressed by Franzen and Derryberry [36] and Jackson [52]. 
The latter author introduces the term “sensitivity” of a test, and sug- 
gests a different statistical method to be used in analysing the experi- 
mental results. (This problem will be considered in detail later in this 
Bulletin.) The comparison of the errors of measurement with the 
differences between the individuals tested seems to be a new approach, 
but it may be that this is what the other authors had in mind when 
speaking of consistency and relative stability; it is assumed in the fol- 
lowing discussion that this is the case. 

It is clear that we have two problems here, not just one. We must 
determine (1) the magnitude of the errors of measurement and (2) the 
relation between the size of these errors and the size of the differences 
between the individuals tested. It is suggested that one term, such 
as reliability, is not sufficient and that the position would be clarified 
if we stopped using such a “blanket” term and used instead the terms 
(defined earlier) "absolute" and “relative accuracy" of measurement. 
Another reason for suggesting a change in terminology is that the 
terms reliability and reliable are used (at least on this continent) in 
another and quite different sense. Garrett [38], for example, speaks 
of the “reliability” of the mean, meaning the errors of estimation of 
the mean, and also of the "reliability", i.e. the significance, of the 
difference between two means. These uses of the term introduce the 
concept of errors of sampling; it is not surprising, therefore, that the 
meaning of the term reliability is not clear. 

There is another related concept which may be discussed briefly 
at this point. This is the concept designated by the term "index of 
reliability". According to Walker [120], this was introduced by 
Spearman and Abelson, although Kelley obtained independently the 
same result. The index of reliability (actually the Square root of the 
reliability coefficient) is an estimate of the correlation between the 


obtained scores and the true scores on a test. Theoretically, this is a 
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useful concept and estimate, but, since we can never know the true 
scores, it is of little practical value. For this reason, therefore, no 
further discussion will be given of this concept or estimate. It may 
be mentioned that occasionally authors of tests give the estimates of 
the index of reliability when discussing the reliability of their tests. 
As the index of reliability is greater, and sometimes considerably 
greater, than the reliability coefficient, this practice is to be condemned 
as the values quoted give the impression (to those who are unfamiliar 
with this type of work) that the test is a more accurate measuring 
instrument than it actually is. 

the problem of reliability may be sum- 


The position with regard to 
marized as follows: The tests and examinations used in psychology 
asuring instruments. Any 


and education may be considered as me 
particular test or examination, like any other measuring instrument, 
is designed for use in particular situations and under certain well- 


defined conditions. If it is used in other situations or under other 


conditions, the measurements may be of little or no value. Even in 
the most favourable circumstances, however, the test or examination: 
is not a perfect measuring instrument and in using it we make errors. 
It is, of course, essential for us to know the size of these errors, ie.the 
accuracy or inaccuracy with which we can measure. The problem, 
however, is complicated by the fact that the measurements, ie. the 
scores, may be used as estimates of the true scores of the individuals 


and, also, in distinguishing between individualsor groups of individuals. 
It is, therefore, necessary for us to determine the magnitude of these 
also to obtain a measure of the size of the 


errors of measurement and si 
errors in comparison with the size of the differences in ability of the 
individuals between whom we wish to distinguish. It is suggested, 


finally, that in order to distinguish between these two aspects of the 
problem, we should use the terms “absolute” and "relative" accuracy 


of measurements instead of the single blanket term “reliability”. қ 
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СНАРТЕК III 


THE MEASUREMENT OF RELIABILITY 


ог population of cases, Generally, 
lation is considered to be infinite, or 


he population is, in this case deter- 
is parameter, the constant standard 


we cannot make an infinite number of measure 
of ø is unknown; what we do is to estimate 


*See also Appendix A. 


7. We make a certain 
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number of measurements, and calculate the standard deviation, say S, 
of the distribution of these and use this as our estimate of о. It is clear 
that 5 may not be exactly equal to с; we never do know exactly the 
magnitude of our errors of measurement. 

The small group of measurements which we made, form what is 
termed a sample, i.e. a certain number of all the possible measurements 
which could be made. In experimental work we almost invariably 
work with samples and sample values, not with populations and popu- 
lation values. As mentioned above, however, we sometimes forget 
this and speak of, and use, the sample values of the statistics as though 
they were the true values of the parameters in the population from 
which we are sampling. These sample values are the only values we 
have, of course, so we must use them, but we should always remember 
that they are only estimates of the true values. 

The theory of sampling enables us to take one further step. Using 
the estimates calculated from the sample, we may determine an inter- 
val and make the statement that this interval covers the true value of 
the parameter. The property of this confidence interval, as it is 
called, is that we know we shall be correct in making this statement 
in a certain fixed proportion of such cases [77]. This is, however, as 
far as we can go; we simply do not know, and in most cases cannot 
hope to know, the true value of the parameter in the population from 
which we are sampling. 

This idea of estimating the population value from the value cal- 
culated for a sample has an important bearing on the whole problem 
of reliability. In quoting the sample value, we claim, explicitly or 
implicitly, that it is an estimate of the parameter in the population 
from which we are sampling. It follows that we have in mind some 
particular population and, to be consistent, we should specify the 
population to which our estimate refers. Let us assume, for example, 
that an author of a test states that the value of the reliability coefficient 
for his test is 0.9. What are the possible populations to which this 
sample value may refer? This depends, of course, on the group or 
groups from which the sample was drawn. It may be that all the 
children tested were from one grade, or were in one class (possibly 
selected) of pupils in a particular grade, or perhaps the sample included 
all the children in one or more grades in a particular school. We must 
remember also that possibly the school contains pupils drawn only 
from a particular social or racial class, or possibly from all classes, and 
we may have a narrow or broad range of ability. In other cases, of 
course, the author may have confined his attention to children of a 
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particular chronological age-group, ten-year-olds, for example. It will 
be realized that the number of possible groups is very large, and unless 
the author gives us some idea of what group or groups were sampled, 
the value he quotes will, at least to a certain extent, be meaningless. 

Further discussion of this point must be postponed until we come 
to consider the experimental and statistical methods used in measuring, 
and the kind of information which it is suggested should be given when 
reporting the reliability or accuracy of a test. It may be noted, how- 
ever, that the population we consider is determined partly by the 
nature of the test, the way in which it is used, and the kind of situations 
in which itisused. Inan individual test, for example, it is likely that 
a knowledge of how accurately the test measuresat a particular chrono- 
logical age-level would be most useful; the interest here is centred in 
the determination of the ability of the individual, not so much in 
distinguishing between individuals, although this is also of interest. 
In a group test, on the other hand, which is generally given by a teacher 
to all the pupils in a particular class or grade, information regarding 
the accuracy with which the test measures and also how well it dis- 
tinguishes between the individuals within the group will be necessary. 
The statistical method suggested for use in problems of this kind gives 
us the answers to all these questions in a single analysis. It does not, 
of course, tell us which units we should use, but it does tell us what 


happens when we use different units or samples from different popu- 
lations. 


(1) Experimental Methods 


Most authors report some statistics regarding the accuracy with 
which their test measures, but there are still some who ask us to accept 
their offered measuring instrument on faith. It seems necessary, there- 
fore, to point out that the only way to determine how accurately the 
test measures is to try it out—and this is the task of the author of the 
test. 

In determining the accuracy of a physical measuring instrument, 
the same objects are measured several times and from the differences 
found we may calculate an estimate of the errors of measurement. 
Possibly the easiest and simplest method to use in determining the 
accuracy of a mental measuring instrument is the same, i.e. to repeat 
the test on the same group of individuals after a certain period of time 
has elapsed. In this case, also, we may obtain an estimate of the 
errors of measurement from the differences found between the two sets 


of measurements. The problem, however, is not as simple in the case 
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of mental as in physical measurements. In the latter case the objects, 
except in the few cases where their destruction is involved, are unaf- 
fected by the measuring process and remain unchanged over a short 
or long period of time. In mental measurements, unfortunately, the 
objects measured—the children—react to the process of measurement 
and in any case change over a period of time, at different rates of 
change. We cannot do much about this, however, since the children 
are alive. It is simply an additional obstacle peculiar to the field of 
measurement whenever living organisms are involved. It is one of the 
reasons why this simple test-retest method, as it is called, is not used 
in all experiments concerned with the determination of the reliability 


of tests. 
The second experimental method requires two or more equivalen 


forms of the same test. Instead of making just one test, two or more 
forms of it are constructed; these are matched, generally item by item, 
for difficulty, content, etc. For tests for which alternative forms are 
available, we do not repeat the same test but give one of the alternative 
forms on the second trial. If the forms are truly equivalent, this 
method is to be preferred asit overcomes some of the weaknesses of the 
test-retest method. As the alternative forms may not be exactly 
equivalent, however, this procedure may, in certain cases, introduce 
an additional disturbing factor. Since, however, the time elapsing 
between the giving of the two forms of tests may be very short, the 
method may be said to obtain children whose abilities are more or less 
constant on the two trials. In using this method, we again obtain an 
estimate of the errors of measurement from the differences found 
between the two sets of measurement. It will be seen that we assume 
that these differences, except for a constant practice effect, are caused 
entirely by the errors of measurement. It follows that if the two forms 
are not equivalent, these differences will tend to be increased and to 
this extent our estimate of the errors of measurement will be biased. 
We would, in fact, conclude that our test is more inaccurate than it 
It will be agreed, however, that, if err we must, it is better 


actually is. 
e than underestimate the magnitude of the errors of 


to overestimat 


measurement. A 
The third method consists simply in giving the test once and from 


the scores obtained, or some function of them, estimating how accur- 
ately the test measures. This method seems to overcome all the 
weaknesses mentioned above, and from a theoretical point of view, is 
probably to be preferred. It will be shown later, however, that in 
practice the assumptions underlying the statistical methods used in 
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analysing the results obtained by the use of this method are not always 
satisfied. The validity of our estimate of the accuracy of our test 
depends mainly upon the type of test we have and, to a lesser degree, 
the relative difficulty and position of the individual items. From the 
practical point of view, therefore, this method is probably the weakest 
of all, and it is suggested that it should be used only for cases in which 
neither of the other methods may be employed. 


(2) Statistical Methods 


Most of the statistical methods used in estimating the reliability 
of tests, or the accuracy with which they measure, may be applied 
without change to the results obtained by using any of the ex- 
perimental methods. There are, however, certain methods which 
are developed for use in particular situations. The analysis of variance 
method suggested by Jackson [52], for example, cannot always be used 
in analysing the results obtained from the third experimental method. 
The author suggested that it could be used in all cases, subject to cer- 
tain possible differences in interpretation, but it has since been found 
that some of the assumptions underlying his method are not always 
Satisfied. This point will be discussed in more detail later; it is suffi- 
cient at this stage to point out that it may be used in some but not all 
of these cases. 

It is convenient to divide the discussion on these statistical methods 
into three parts and to deal with each one separately: 

(a) methods used in estimating reliability coefficients; 
(b) methods used in estimating the errors of measurement; 


(c) the method Suggested for use by Jackson (see above) in esti- 
mating the sensitivity of a test. 


(a) Methods used in estimating reliability coefficients 
Although it gives us only an indir 
the test, the reliability coefficient is t 
reliability. It is a correlation coe 


ect measure of the accuracy of 
he most widely used measure of 
ficient, the coefficient for the 


If the test is given only once, it is c 
odd and even items and use the obtained correlation coefficient as an 
estimate of the reliability of either half of the test. The estimate of 
the reliability for the whole test is then calculated by using the well- 
known Spearman-Brown formula for double length. If we denote by 
Xi the score obtained by the i-th individual on the first testing; by 
Y; the score obtained by the same individual on the second testing; 


ustomary to take the scores on the 
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by Z summation over all WV values of z, then the reliability coefficient, 
r, may be defined as 
(ЕХ) (СУ) 

N 
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If the coefficient so obtained refers to only half the test, then the co- 
efficient for the whole test is obtained by using the formula 


m 2r, е. (2) 
i45 E 

where r} denotes the half-test, and 7, the whole test, coefficient, res- 
pectively. It may be noted, however, that under certain conditions 
the value obtained by using equation (1) is not the best estimate of the 
population value; this particular problem is discussed in detail in 
Appendix A of this bulletin. This appears to be another of the cases 
in which research workers have accepted a statistical method without 
examining critically the assumptions underlying it and comparing 
these with the conditions of their own problems. 

Kuder and Richardson [63] have suggested another method of 
estimating the reliability coefficient. They have shown that, subject 
to certain assumptions, we may obtain an estimate of the reliability 
coefficient from the results of a single application of the test by using 


the following formula 

y (2e) eed (3) 

n—1l S? 

where ғ is the reliability coefficient; z the number of items in the test; 
S the standard deviation of the distribution of scores obtained on the 
test; p the proportion of subjects passing any given item; g=1—p ; 
Хра the sum of the products of p and д for all items in the test. This 
is one of the special methods mentioned above and, as will be shown 
later, it seems to give a measure of the internal consistency rather than 
the reliability of the test in the usual sense of the term. 


ХУ, 


(b) Methods used in estimating the errors of measurement 


The reliability coefficient by itself does not give us an estimate of 


what has been termed the absolute accuracy of our measurements, 


although, as will be shown in part (c) of this section, it does give us 
an indirect estimate of their relative accuracy. We may, however, 


use the reliability coefficient in obtaining our estimate of the absolute 
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accuracy, i.e. in-calculating an estimate of what is termed the standard 
error of measurement of an individual score. This estimate, however, 
may be calculated directly from the distribution of differences between 
the scores obtained on the two testings. Assuming that these differ- 
ences, except for the constant practice effect, are caused solely by 
errors of measurement, we may use 1/4/2 times the standard deviation 
of the distribution of differences as a measure of the errors, and, there- 
fore, speak of the "standard еггог” of measurement. The other method 
is to use the standard deviation of the obtained test scores times VI =r 
as the estimate, i.e 

Sae МЕР Е (4) 
where Sz denotes the standard error of measurement; S the standard 
deviation of the distribution of scores on the test, and ғ the reliability 
coefficient. If the standard deviations of the distributions of the two 
sets of scores are equal, then these two values of Sg are identical. 

This assumption of equal standard deviations is fundamental to 
all our work on the determination of the reliability or accuracy of our 
test. If weare using two alternative forms of a test, for example, and 
find that the standard deviations are different, we cannot estimate 
either the reliability or the errors of measurement. We must conclude, 
in such a case, that the two forms of the test are not equivalent and 
refrain from using this particular experimental method. It may be 
argued that the inequality of the standard deviations will not affect 
our correlation coefficient. This may be true but it does not affect 
the issue at stake. If the forms are not equivalent, as they cannot be 
if the standard deviations are significantly different, then they must 
differ in some respect. Hence we cannot use this particular experi- 
mental method, as the basic assumption underlying this method is 
that the two forms are equivalent. It does not matter so much here 
if one of the forms is easier or more difficult than the other, provided 
the standard deviations are the same, as this constant difference will 
affect none of the results except the norms for the test. 

There is one danger connected with the use of this indirect method 
of determining Sz. The formula given in equation (4) may be used to 
determine Sz if, and only if, the S and r on the right-hand side refer 
to the same group of cases. We generally find that 5; is constant for 
most groups and, if this is so, then r and S must be related and hence 
it may be incorrect to use in the same formula an S and an r which 
refer to different groups. Occasionally we find that Sg itself may 
change from group to group (as in groups of pupils chosen from dif- 
ferent school grades, for example), and we conclude that the test does 


32 


not measure with equal accuracy at all levels. In such cases it is, of 
course, particularly important to exercise care in the use of the formula 
given in equation (4). Since the method discussed in the next part 
uses a direct rather than an indirect approach to this problem, it is 
suggested that it is probably the better one to use. 


(c) The method suggested for use in estimating the. sensitivity of a test 

The method suggested for use is the one known as the Analysis of 
Variance. As correlation is not used, it will seem strange to many 
workers in education and psychology, but it is being more and more 
widely used in these fields. As the general use of this method has been 
discussed elsewhere [53] only its application to the problem of relia- 
bility will be considered here. The theory underlying the method will 
not be discussed in any detail as it is felt that research workers will be 
interested in the practical rather than the theoretical aspects of the 
problem. 

The idea underlying the method is very simple. It is assumed 
that the score made by an individual on a test may be considered as a 
sum of independent components, and the analysis is designed to give 
a measure of the.influence of each of these. In the problem of relia- 
bility the factors are few in number and the results obtained in the 
analysis are easily interpreted. Asit is easier to work with a particular 
situation in mind, let us start with some experimental results. The 
data given in the second and third columns of Table I refer to the 
scores made by a small class of 29 pupils on two forms of an intelligence 
test. What factors or components may be important? In the first 
place, it is clear that not all the individuals in the class áre of the same 
ability, so we must find out how well our test measures the differences 
between the pupils, i.e. distinguishes between the individuals. Sec- 
ondly, it will be seen that the pupils make, on the average, higher 
scores on Form B than on Form A; Form A was given to the children 
first, so this constant difference is called a measure of the "practice" 
effect. Finally, it will be seen that even after allowing for the influence 
of this practice effect, the scores on the two forms differ considerably. 
These residual differences we assume to be due to the errors of measure- 
ment by means of the test used, or, rather, we define the errors of 
measurement in this way. There is no other factor, except possibly 
fluctuations in the ability of the individuals and differences between 
the two forms, which we cannot isolate in such a simple experiment, so 
ve class all these residual differences as error. 

The next problem is to determine how to measure the effect of 
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these various factors or components. This is done by obtaining a 
measure of the amount each of these contribute to the total variance 
(variance is a technical term used to denote the square of the standard 
deviation) of the scores. We break up the total variance, or rather the 
sum of squares of the deviations about the mean from which an esti- 
mate of the variance is calculated, into components which we may 
assign to these different factors. This gives us a means whereby we 
may determine the importance of the influence of each of these factors, 
and hence draw conclusions concerning the usefulness of our test as a 
measuring instrument. 

As far as the arithmetic of the analysis is concerned, this is quite 
simple—probably even simpler than the arithmetic procedure т- 
volved in the calculation of a correlation coefficient. We calculate for 
each individual the sum of his scores, and the difference between his 
scores, on the two forms of the test as shown in the last two columns 
of Table I. Then we calculate the sum and sum of squares of the 
values in each column and write these in the spaces provided in the 
two bottom rows of the table. In the column headed X in the table, 
for example, the sum is simply the total of the 29 values in the column, 
and the sum of squares (in the bottom row) is the total of the squares 
of each of the 29 values. It will be seen that three checks on the 
accuracy of our work may be made at this stage: the sum of X+Y 
must be the same as the sum of X plus the sum of Y, i.e. 

1390633 +757 
Similarly, for the differences, 
) й —124=633—757 
and, for the sums of squares, > 
78760 +1684=2 (16537 +23685) 
A final check is made at a later stage in the analysis. 

To calculate the sum of squares, from which the estimates of vari- 

ance attributable to each factor are obtained, we proceed as follows: 
» (1) for Error 


- 2 

à [ 1684- (2207) = 521.667 
(2) for Between Individuals 
2 

3 k= 0 = 6067.931 


(3) for Practice Effect 


1 [C] = 320.333 
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TABLE I 


Scores RECEIVED BY PuPILS ох Forms А AND В ОЕ AN INTELLIGENCE TEST 


Score on Sum of 
Pupil Scores Difference 
No. Form A Form B X+Y Х-Ұ 
х Y 

1 9 14 23 -5 
2 15 22 87 -7 
Б] 9 12 21 -3 
4 10 19 29 -9 
5 40 87 77 3 
6 13 8 21 5 
Z 19 20 39 -1 
8 17 34 51 -17 
9 18 19 87 3 
10 15 20 35 -5 
11 24 29 58 -5 
12 24 24 48 0 
13 13 28 41 —15 
14 29 30 59 -1 
15 13 16 29 -8 
16 23 26 49 -8 
17 19 28 47 -9 

18 24 15 89 9° 
19 16 16 32 0 
20 41 46 87 —5 
21 35 30 65 5 
22 24 30 54 —6 
23 33 53 86 —20 
24 24 27 51 —3 
25 32 41 73 -9 
26 20 24 44 -4 
27 45 56 101 -11 
28 12 11 23 1 
29 17 22 39 —5 
Sum 633 757 1390 —124 
Sum of Squares] 16537 23685 78760 1684 

(4) for Total 
(1390)? 


= 6,909.931 


16537 +23685 — 58 


It is customary, and also convenient, to put all these values in a table 
of the form shown in Table II. 
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TABLE II 


ANALYSIS OF VARIANCE OF SCORES MADE BY PuPiLs oN Two Forms 
OF AN INTELLIGENCE TEST 


Degrees 
Variance of Sum of Mean 
Freedom Squares Square 
Әне to Practice Efect „Ж... .............. 1 320.333 320.333 
Between Individuals.................. 28 6,067.931 216.712 
ВО ar a Е 28 521.667 18.631 
К Ны ағаға alu iege elei 57 ` 6,909.81 |  ...... 


As the total of the sum of squares for (1), (2) and (3) must be iden- 
tically equal to (4), this gives us a final check on the accuracy of the 
calculations. 

The first column of Table II is self-explanatory; we simply list the 
factors in which we are interested, and the quantities in the third 
column have been explained above. Тһе entries in column 2, headed 
“Degrees of Freedom", will require some explanation. These quan- 
tities are used as divisors in calculating the values shown in the last 
column (headed ‘‘Mean Square"); in the Between Individuals row, for 
example, 216.712 — 6067421 
calculated іп а similar manner. Generally, іп calculating the value of 
the square of the standard deviation, we divide the sum of squares: by 
the number of observations in the sample. In small samples, however, 
this estimate is biased and the bias may be compensated by dividing 
by the number of degrees of freedom instead of the number of obser- 
vations. In examples of the kind considered here, the number of 
degrees of freedom to be used are as follows: 


, and the other mean square values are 


(1) Due to Practice Effect 1 
(2) Between Individuals n—i 
(3) Error #—1 
(4) Total 2n—1 


where z denotes the number of individuals tested. It will be noticed 
that the additive property discussed above in connection with the sums 
of squares applies also to the degrees of freedom, 1.с. 


1+(—1)+(#—1)=2п—1 
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For the benefit of those who are interested in the question of the 
number of degrees of freedom associated with any particular sum of 
squares, it may be pointed out that this number is always equal to the 
number of independent deviations which are used in the calculation 
of the associated sum of squares. For the error row, for example, we 
see from Table I that while there are 29 differences (one for each pupil), 
yet in calculating the sum of squares for error we subtract the mean 
difference (- шы) from each. We have, therefore, only 28 of these 
difference deviations which are independent, and hence the number of 
degrees of freedom is 28. 

With regard to the use which is to be made of the results shown in 
Table II, we sce that in the first place we may obtain an estimate of 
the standard error of measurement of an individual score, denoted by 
Sy in equation (4), by taking the square root of the error mean square. 
We find Sg = +/18.631 —4.32 score units. This gives us directly an 
estimate of the absolute accuracy of our measurements. 

The next problem in which we are interested is to determine whe- 
ther or not the practice effect is significant, i.e. significantly different 
from zero. If the practice effect is zero, then the corresponding mean 
square in the table will be of the same order of magnitude as the error 
mean square; if the practice effect is significant, its mean square will 
be larger than the error mean square. It follows, therefore, that if we 
can show the practice effect mean square is significantly greater than 
the error mean square, then we may conclude that the practice effect 
is significant. In making this test, we proceed as follows: 

(1) calculate 

_ 320.333 _ 
18.631 

(2) refer to Snedecor's table [53] of F with degrees of freedom 
пу=1 and п =28 ; 

(3) conclude that the two mean squares are of the same order 
of magnitude if the calculated value of F is less than the 5% 
(or 1%) point of the distribution of F given in the table, or 
conclude that the two mean squares differ significantly if 
the calculated value of F is greater than the 5% (or 1%) 
point of the distribution of F given in the table." 


17.19 ; 


1The question of whether to use the 5% or the 1% point of the distribution 
of Е as a critical value is a personal one, and is sometimes determined by the 
nature of the problem under consideration. It is customary, however, to conclude 
that the mean squares are different if the calculated value of F is greater than the 
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In our particular case we find that the 5% and 1% points of the dis- 
tribution of F are 4.20 and 7.64, respectively. As the value of F (F— 
17.19) which we obtained is considerably greater than the 1% point 
we conclude that the practice effect mean square is greater than the 
error mean square, and hence that the practice effect is significant. 
The next problem is to determine whether or not the test measures 
with an accuracy sufficient to enable us to distinguish between the 
individuals tested. If the accuracy with which the test measures is 
not sufficient for this purpose, then the differences between the scores 
obtained by the individuals on the test will be very small and due 
solely to the errors of measurement; in this case the between indi- 
viduals mean square will be of the same order of magnitude as the 
error mean square. If, on the other hand, the accuracy with which 
the test measures is sufficient for this purpose, then the differences 
between the scores will be larger than could be explained solely on the 
basis of the errors of measurement, and the between individuals mean 
square will be significantly larger than the error mean square. It fol- 
lows, therefore, that if we can show the between individuals mean 
square is larger than the error mean square, then we may conclude 
that the test measures with an accuracy sufficient to enable us to dis- 
tinguish between the individuals tested. The procedure followed in 
making this test is similar to that considered in the previous case: 


(1) calculate 


p.257 Les; 
18.631 
(2) refer to Snedecor's table of F with degrees of freedom 


Ny =N2= 28 ; 
(3) conclude that the two mean squares are of the same order 
of magnitude if the calculated value of F is less than the 5% 
(or 1%) point of the distribution of F given in the table, or 
conclude that the two mean squares differ significantly if 
the calculated value of F is greater than the 5% (or 1%) 
point of the distribution of F given in the table. As the value 
we found (F =11.63) is larger than the 1% critical value, we 
conclude that the two mean squares differ significantly and 
hence that the accuracy with which our test measures is 
sufficient to enable us to distinguish between the individuals 
tested. 
1% point, to refrain from drawing any conclusion if it lies between the 5% and the 
1% points, and to conclude that they are of the same order of magnitude if it is less 
than the 5% point. This is, however, a purely arbitrary choice of a critical value. 
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This leads us to the problem of determining the relative accuracy 
of our measurements, i.e. the relation between the size of the errors of 
measurement and the size of the differences between the individuals 
tested. Jackson [52] has suggested that the simplest measure of the 
relative accuracy is what he has termed the sensitivity of the test. 
This is defined as 

бс 


| фе (5) 


g 

where y denotes the sensitivity of the test; с, the standard deviation 
of the distribution of ability in the population from which we are 
sampling, and е is the standard deviation of the distribution of the 
errors of measurement. It will be seen that this measure is particu- 
larly easy to interpret; if y is small, then the errors of measurement 
will be large in comparison with the differences between the abilities 
of the individuals tested, and the score obtained by an individual on 
the test may be determined largely by these random errors of measure- 
ment. Fora particular value of y, we can determine the probability, 
say n, from the tables of the normal integral of making an error greater 
than or equal to c, units due to chance alone in using the score as an 
estimate of the ability of an individual. Certain values of y and 7 are 
given in Table III. It will be seen that even for fairly large values of y, 
the random errors of measurement may be very important in deter- 
mining the actual score of an individual on the test. 


TABLE III 


VALUES OF Y AND 7 


т 0.5 10 1.5 2.0 2.5 3.0 


n .62 132 | 18 046 012 .003 


- In the population from which we are sampling, the sensitivity and 
the reliability coefficient are related, i.e. 


» Y = AEG с ATL (6) 

1-р 
where р denotes the population reliability coefficient. The reliability 
coefficient does, therefore, give us an indirect estimate of the relative 
accuracy of measurements. In estimating y from the sample values, 
however, it is better to proceed directly rather than to attempt to use 
the reliability coefficient. The estimate of y discussed in the next 
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paragraph is the best estimate which сап be obtained from the sample 
values. 


In estimating y, we may use either a unique estimate or, better 
still, the "confidence interval" which was discussed earlier in this 
chapter. To obtain the unique estimate, we proceed as follows: 


(1) subtract the error mean square from the between indi- 
viduals mean square; 


(2) divide the difference by twice the error mean square; 
(3) use the square-root of the quotient as an estimate of y. 
For our example, from the values given in Table II, we have 
(1) 216.712 — 18.631 — 198.081 
(2) 198.081 = 5316 
37.262 
(3) est. у= У/5.316 = 231. 
To find the confidence interval we proceed as follows: 


(1) calculate the ratio of the between individuals to the error 
mean square, which we may denote by F; 

(2) from Snedecor's table of F find the 5%, or 1%, 
distribution of Р, which may be denote 

(3) to find the lower limit of the interval, 
example, calculate 


point of the 
d by Ез or Еш»; 


бау y, using Fy, for 


= F 1 2 
Эле e “oath (7) 


(4) to find the upper limit of the interval, say 7, using Fic, for 
example, calculate 


Ё ЕР. 1 
TAE y £2 ыды Тақ (8) 


(5) we may make the statement 
YSYS7 


and we know that the probability of this s 
(it would be .90 if we had used F; 


216.712 


TETE (9) 


tatement being correct is .98 
%). For our example, we have 


ПЕ = = 11.63 
а 18.631 ë 
(Fi = 2.50 

11.68 1 _ 
(3) у - 18 1 = 1.35 
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(4) 7 m TE = 3.75 


(5) 1.35<7<3.75. 


From a theoretical point of view, the confidence interval method is 
the better one to use, but in practice the unique estimate is probably 
more convenient. In using the latter estimate, however, one must 
always remember that it is subject to error. 

In some problems where a particular degree of sensitivity is re- 
quired, it is necessary to determine whether or not the test considered 
reaches this standard. This problem is discussed in detail in Appendix 
B; the solution given is simple and involves merely an extension of the 
method used earlier in this section. The statistical problem proposed 
and solved is that of developing a test of the hypothesis ү =K, where K 
is some value fixed in advance. The usefulness of this statistical test 
is obvious. 

If we use the correlation method in analysing the above experi- 
mental results, we find a reliability coefficient of r=0.84. The diffi- 
culty of interpreting this result will be admitted by all. It is suggested, 
therefore, that it is better to use the analysis of variance method in 
analysing the results of experiments of this kind. 

It was mentioned earlier that our results will be different if we use 
different groups in our experimental work, i.e. if we sample from a 
different population. Let us consider, for example, the effect on the 
results if we sample from four grades instead of just one. Figure 1 
shows the distribution of scores made by pupils on two forms of an 
intelligence test: as different symbols are used to represent the scores 
which refer to different grades, this diagram shows the effect of using 
the broader unit in sampling. It will be seen that the values are spread 
along a line, like beads on a wire. The more grades we include the 
more important does this “elongation” effect become. It is obvious 
that the reliability coefficient will be increased, and in some cases 
greatly increased, by this effect. If we use the analysis of variance 
and covariance method in analysing our results, we obtain a measure 
of the influence of this effect, and also a measure of the reliability freed 
from the influence of this effect. It is clear that in this case the factor 
causing the “elongation” is the differences between the grades, so this 
is the factor which we must measure and eliminate. 

In order to make this analysis we need, for each form and for each 
grade, the sum, sum of squares and finally the sum of products of the 
scores: the data relating to this example are given in Table IV. 
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SCORE ON SECOND TEST 


DISTRIBUTION OF SCORES MADE BY PUPILS ON 
TWO FORMS ОҒ AN INTELLIGENCE TEST 
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TABLE IV 


Data RELATING TO THE SCORES OF Рори. ох Two FORMS OF AN 
INTELLIGENCE TEST 


Sum of Sum of Sum of 
Sum of Sum of Squares | Squares | Products 
Number Scores Scores of of of Scores 
of on on Scores Scores on First 
Grade * Pupils First Second on on and 
Form Form First Second Second 
Form Form Form 
3 98 2,094 2,552 53,408 77,042 | 62,518 
4 112 3,862 4,408 150,250 193,600 | 168,958 
5 136 6,134. 7,143 295,016 394,941 | 338,963 
6 108 5,846 6,442 329,078 397,728 | 360,596 
Total..... 454 17,936 20,545 828,352 | 1,063,311 | 931,035 


As we want the sums of squares and products of the deviations from 
the means, we must again calculate these separately for each grade 
and form, and also for the total. For Grade 3, for example, we have: 
(1) Sum of Squares of Deviations for First Form 
53,408 — (20997 
98 
= 53,408 — 44,743.224 = 8,664.776 
(2) Sum of Squares of Deviations for Second Form 
77,042 — (2552) 
98 
= 77,042 — 66,456.163 = 10,585.837 
(3) Sum of Products of Deviations 
62,518 — (2,094) (2,552) 
98 
= 62,518 — 54,529.469 = 7,988.531. 
It is convenient in an analysis of this kind to show all these values as 
in Table V. The values shown in the second row from the bottom of 
the table are the sums of the values in the preceding four rows. Some 
of these are used later in calculating a measure of the effect of the 
differences between grades, and the others (in the last three columns) 
give us the total sums of squares and products for within grades. 
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TABLE V 


QUANTITIES REQUIRED IN THE CALCULATION Or THE BETWEEN GRADES 
AND WITHIN GRADES SUMS ор SQUARES AND PRODUCTS 


Sum of Squares and Products 


Correction factors: Squares and of Deviations about Means 


Products of Sums of Scores 


divided by Number of Pupils Sum of Sum of 
Grade Squares for | Squares for 
First Second First Second Sum of 
Form Form Products Form Form Products 


3 | 44,743.224| 66,456.163| 54,529.469 8,664.776] 10,585.837 | 7,988.531 


4 133,170.036| 173,486.286] 151,997.286 17,079.964 


20,113,714 | 16,960.714 


19,775.934 | 16,792.691 


5 |276,661.441) 375,165.066| 322,170.309 18,354.559 


6 |316,441.815| 384,253.370| 348,703.074 13,236.185| 13,474.630 | 11,892.926 


Sum. . |771,016.516] 999,360.885| 877,400.138| 


57,335.484| 63,950.115 | 53,634.862 
Total x » 
for all 
grades |708,590.520] 929,729.130] 811,663.20 


119,761.480| 133,581.870 119,371.740 


Using these results, we may present our final analysis of variance 
and covariance in the form shown in Table VI 


TABLE VI 


ANALYSIS OF VARIANCE AND COVARIANCE OF SCORES МА 


DE BY PUPILS ON Two FORMS 
OF AN INTELLIGENCE TEST 


Degrees 


Variance of Sum of SquaresSum of Squares 


Freedom | First Form Second Form ра 
Between Grades...... 3 62,425,996 69,631,755 65,736.878 
Within Grades....... 450 57,335,484 63,950.115 53,634.862 
Ты гое 453 119,761.480 | 133,581.870 119,371.740 


The sums of the squares and products are obtaineq as follows: for 

within grades, we use the last three values given in the Second row from 

the bottom of Table V; for total, we use the last three values given in 
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the last row of Table V; for between grades, using the values shown 
in the last two rows and columns 2, 3 and 4 of Table V, we find 
(1) for the First Form 
771,016.516 — 708,590.520 = 62,425.996 
(2) for the Second Form 
999,360.885 — 929,729.130 = 69,631.755 
(3) for the Products 
877,400.138 - 811,663.260 = 65,736.878. 


Аз a check on the accuracy of our calculations, we note that in each 
column of Table VI the sum of the values for between and within 
grades is identically equal to the total given in the last row. The 
degrees of freedom shown in Table VI are found as follows: since there 
are 4 grades, the number of degrees of freedom for between grades 
will be 4—1=3; the number of degrees of freedom for within grades 
may be obtained from the second column of Table ТУ: (98—1)+ 
(112— 1) + (136 — 1) +(108 — 1) =450; the number of degrees of free- 
dom for the total is, from Table IV: 454—1 =453. 

From the values given in Table VI, ‘we may calculate estimates of 
three reliability coefficients: 

(1) for between grades 


E 65,736.878 
v (62,425.996) (69,631.755) 
= 0.997 
(2) for within grades 
53,634.862 
r = У(57,335.484) (63,950.115) 
= 0.886 t 
(3) for total (i.c. all grades) 
119,371.740 
ps 
v (119,761.480) (133,581.870) 


= 0.944. 


The difference between these last two estimates gives us a measure of 
the effect of using the larger unit in sampling, i.e. sampling from 4 
grades instead of just one. The estimate calculated from the totals 
for all grades is increased by the inclusion of the very significant differ- 
ences between grades, i.e. the “elongation” effect shown graphically 
in Figure 1. The first estimate, r —0.997, refers to the means for the 
grades and not to the individual scores; it is, as one would expect, 
considerably higher than the within grades estimate. When we use 
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the estimate for the total, r—0.944, both the between grades and 
within grades are included; we have, therefore, neither the one nor the 
other but a kind of compound of the two. 

This raises the question :—which estimate should be used? Unfor- 
tunately, this cannot be answered. The question as to which estimate 
is appropriate will be determined by the conditions of the problem on 
which we are working. It is probably safer to give them all, so that 
other research workers will have no difficulty in interpreting the results. 
It is clear, however, that our results may be misleading or meaningless 
unless we state clearly the nature of the population from which we 
have sampled and to which the values refer. 

The above suggested analysis refers to the estimates of the relia- 
bility coefficients. We may, however, extract considerably more 
information from the data if we use analyses of the type shown in 
Table П. We first analyse the results separately for each grade, as 
shown in Table VII, and then compare the values in the different rows 
in order to determine whether or not the results may be combined 
[see 53, pp. 83-96]. We find that our estimates of the errors of measure- 
ment do not differ significantly from grade to grade (see the “Еггог” 
row of Table УП), so we conclude that the test measures with the same 
absolute accuracy at all levels. The best estimate of the standard 


errors of measurement, Sp, may be found from the values given in this 
error row of Table VII as shown below: 


524 и (1,636.77) + (1,636.13) + (2,272.55) 1.46248) 


97+111+135+107 
е7 / 7007.93 
И 450 


= 3.95 ѕсоге ипїїз. 


When we consider the question of the relative accuracy of the 
measurements, i.e. compare the relative accuracy with which the test 
measures in different grades, we find significant differences, The test 
distinguishes between the individuals best in Grade 4, the efficiency is 
slightly lower in Grades 5 and 6, and poorest in Grade 3, The test 
seems to be too hard for Grade 3 and hence does not distinguish 
between the individuals so well. It is clear that the test is better 
suited for testing children in Grades 4, 5 and 6. 

With regard to the practice effect, we find a 
between the grades are significant—mainly а 
effect occurring in Grade 5. No explanation 
however, so we can only conclude that, for so 
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practice effect on this test is significantly greater in Grade 5 than in 
the other grades. 

In cases such as this, where significant differences between grades 
or other groups are found, it is clear that it is wrong to combine the 
results and give only one analysis. The differences are often more 
useful than the similarities in determining the usefulness or appro- 
priateness of a test. It is suggested, therefore, that in all cases an 
analysis similar to that shown in Table VII should be made. If signi- 
ficant differences between the grades or groups are found, the results 
should be shown separately for each group and not combined into a 
single analysis. If, in spite of these differences, a combined analysis 
is given, it should be clearly stated that the results apply only to the 
total of the groups and not necessarily to the component groups which 
form the total. The question of whether or not a useful interpretation 
of the results of the total analysis is possible can be answered only by 
an examination of the nature of the problem under consideration. 
Note— 

The values given in Table VII may be obtained by the method 
discussed in the first part of this section, or from the values given in 
Table IV. Using the results given in Table IV, we find for Grade 3, 
for example: 

(1) Sum of Squares corresponding to Practice Effect 
_ (2,094)?+-(2,552)2 _ (2,094-+2,552)2 
. 98 196 
+ = 111,199.39—110,129.16 = 1,070.23. 
(2) Sum of Squares corresponding to Between Individuals 


* = 3 | 53,408-+77,042-+2(62,518) — — 
= $ [255,486 —220,258.33] 
= 17,613.84. 

(3) Sum of Squares corresponding to Error 


53,408 +77,042—2(62,518) — | 
98 
= à [5414—2,110.45] 


= 1,636.77. 
(4) Sum of Squares corresponding to Total 


53.408-++77,042 — (2094 4+-2552)2 


196 
130,450 — 110,129.16 
20,320.84. 


ыя 
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Тһе corresponding values for the other grades may be calculated 
in a similar manner. 


(3) Comparison of the Accuracy of Physical and Mental Measurements 


Before we proceed to the next chapter, it is convenient at this stage 
to discuss the results of a little experiment which we carried out to 
compare the accuracy of physical and mental measurements. Since 
it was obviously more difficult to control an experiment involving 
mental measurements, it was decided to arrange an experiment with 
physical measurements which would correspond to the conditions 
generally found in measuring with mental tests. We had, therefore, 
to arrange a series of objects of different magnitudes and make two 
measurements of each. We could not, of course, arrange for the 
objects to change while being measured, or change between measure- 
ments, but otherwise the conditions seem to be comparable. 

We chose to measure the lengths of strips of cardboard; 100 strips 
of different lengths were used. These were arranged so that we had a 
normal distribution of lengths; the distribution is shown in Table VIII 
(a class-interval 3/4 of an inch in length is used in this table). 


TABLE VIII 
DISTRIBUTION oF LENGTHS ОЕ 100 STRIPS 


1 
ОЕ CARDBOARD (UNITS OF 35 OF AN INCH) 


Class Interval Frequency * 
91-114 1 e 

115-138 1 

139-162 2 

163-186 6 

187-210 9 

211-234 10 

235-258 13 

259-282 14 

283-306 14 

307-330 10 

331-354 9 

355-378 6 
379-402 3 
403-426 1 
427-450 1 
Total; ¿us zw 100 _ 


This gave из a distribution of lengths for our physical measurements 
comparable to the distribution of ability for mental measurements. 

The next step was to obtain two measurements of the length of 
each strip of cardboard in order that we might calculate a reliability 
coefficient comparable to those obtained in mental measurements. 
Somewhat to our surprise, considerable difficulty was experienced in 
reducing the accuracy of our measurements to the level found in the 
mental field. If we used an ordinary rigid measuring instrument, 
such as a ruler graduated in inches, the reliability coefficients were of 
the order of 0.99. It was obvious, therefore, that we had to use some 
non-rigid measuring instrument and deliberately introduce random 
errors of measurement. The plan which we finally adopted, after the 
trial and rejection of numerous others, is explained below. 

A strip of rubber approximately one-half an inch in width and 
16 inches long was cut from an inner tube of an automobile tire. This 
was stretched to twice its ordinary length and a scale marked on it 
(a unit on the scale at this tension corresponded approximately to 
one-eighth of an inch). These units were, of course, purely arbitrary, 
but this was immaterial. Finally, we fastened two clips firmly on the 
ends of this strip in order to make certain that the same length of 
rubber was used each time. The random errors of measurement were 
introduced by varying the tension each time the “rubber ruler” was 
used; as the clips were such that they could be slipped over the head 
of an ordinary rail, we could control the tension applied. We used 
six different tensions; these are denoted, from the highest to the lowest, 
by the numbers 1, 2, 3, 4, 5 and 6 in the following discussion. 

In order to simplify the measuring Process, and to control all 
extraneous factors, we proceeded as follows: А board approximately 
10 inches in width and 40 inches in length was Procured, and in this 
7 nails were driven in the positions indicated in Figure 2, By slipping 
the clip at one end of the "rubber ruler" over the nail at 0, and the 
clip at the other end over the appropriate nail at the right hand side 
of the board, we could obtain any desired tension, By placing the 
strip of cardboard to be measured in position, a measurement of its 
length could be obtained with little difficulty. This procedure Was 
followed throughout the experiment. (It should be noted that the 
scale as marked did not extend the full length of the "rubber ruler”) 

The strips of cardboard were arranged in random order of length 
and numbered from 1 to 100. Finally, а random series of tensions were 
chosen and two measurements of each strip of cardboard made accord- 
ing to this series. The figures given in Table IX show the number of 
each strip of cardboard, the tensions used in the measurements, and 
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the estimates of the length of the strip obtained by the use of the 
"rubber ruler". From the values given in the last two columns of the 
table, we may calculate an estimate of the reliability coefficient, or 


6 
5 
J SS 4 š 
SO UE SF 
Ето mmm 2 
o 325 І 


Ficure 2. 


—Apparatus used in Measuring the Length of the Strips of Cardboard. 


estimates of the absolute and rela 
by using the method known as th 
Denoting by X; and Y; the values obtained in the first and second 


measurement, respectively, of the i-th strip of cardboard, and sum- 
mation over all 100 values of i by Z, we find: 


tive accuracy of the measurements 
analysis of variance, 


ХХ; = 6449 
ЖҮ; = 6465 
2X? = 442539 


ZY? = 445847 

УХУ; = 440620 
Using these results we have r= 
efficient. Our measurements, t 
in using a good intelligence t 
instrument. 

Using the analysis of variance method, we find the results shown 
in Table X. The differences between trials is not significant, but the 
variance between the strips is significantly greater than the error 
variance. We conclude, therefore, that our "rubber ruler” measures 
with sufficient accuracy to enable us to distinguish between the lengths 
of the strips of cardboard. 


0.869 as the estimate of reliability co- 
herefore, are as reliable as those found 
est, or any other mental measuring 


TABLE X 


ANALYSIS OF VARIANCE OF THE MEASUREMENTS ОЕ THE 
LENGTHS ОЕ 100 STRIPS oF CARDBOARD 


Degrees of Sum of Mean 

Variance Freedom Squares Square 
Between Trials... 1 1.28 1.28 
Between Strips... 99 50,956.02 514.71 
99 3,571.72 36.08 

199 54,529.02 | ———— 


Calculating an estimate of the sensitivity of our measuring instrument 
by the method explained earlier in this chapter, we find y =2.6. 

Finally, we may use the square root of the error mean square as an 
estimate of the standard error of our measurements; in this case Sg is 
of the order of 6 units. 

If these results are compared with those given in Tables I and IT 
for the two forms of an intelligence test, it will be seen that, considered 
merely as a measuring instrument, our “rubber ruler” compares quite 
favourably with an intelligence test. It is realized, of course, that the 
conditions underlying the problems are not exactly the same, but it is 
felt that they are so nearly the same that the comparison is valid. 
This little experiment with a “rubber ruler", therefore, gives us a 
comparison of the accuracy of physical and mental measurements and, 
at the same time, gives us a clearer idea of the kind and magnitude of 
the errors which we make when we use a mental test as an instrument 
for measuring the ability of an individual, or of groups of individuals. 
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CHAPTER IV 


EXPERIMENTAL RESULTS 


Experiment |. Estimation of Reliability Coefficients by the Split-half or 
Odds-even Method. 

It has been found that the estimates of the reliability of a test 
obtained by the use of different methods do not agree. The split-half 
estimate, generally obtained by correlating the scores made by the 
individuals on the odd and even items of the test, may be higher or 


by the use of comparable forms of 


split-half and comparable forms or test-retest estimates. The results 
given below refer to the relationship between the split-half and com- 
parable forms estimates, but they apply equally well to the relationship 
between the split-half and test-retest estimates, 

The theoretical relationship between these estimates is not difficult 
to determine and is well known. It will be re-developed here, how- 
ever, since it is necessary to state clearly the assumptions made in 
determining the relationship, and to test the validity of these as- 


sumptions, 
Denote by 
Zh :—the score of the /-th individual on the first form of the 
test; 
Zi :—the score of the ¢-th individual on the second form of the 
test; 


Хи, Yu:—the scores made by the t-th individual on the odd and 
even items, respectively, of the first test; 
Ха, Үзг--іһе scores made by the ¢-th individual on the odd and 
even items, respectively, of the second test 
У :—summation; 
S :—the standard deviation, e.g. 


Sa = y Maa E2, 
| ДСУ EN; 
:—the Pearson product-moment correlation Coefficient; 


М :—the number of individuals tested. 
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It may easily be shown that 


3 = SE ebb ОАР 7-2. узга (10) 

Sha Se р Ба 9 00v (11) 

ЖОЛУ АЕРА tI MSN Sy, — е-е» (12) 
Fryx Sy Sx Hry rSv Sy, 

Taz = TxxSxiSx rx Sx Sy: HT yix Sy Sx ty Sr Sy: E) 


V [St +S, -2rx v S Sv] [Sk, +S. 20x 0S] 


Obviously rz,z, is determined exactly by the values of the six inter- 
correlations yix» xiy» YX» Гуты TM Tx» and the values of 
the four standard deviations Sy, Sy, Sy, Sy. If we assume that 


Ру Ру у ЕТУХ, =t yy, = ау, =x, y, ЕТ, бау 
1) ХХ: ау. Рах Түрі Tun xvn ay 
0) and. (2) бб зы зу NET TESI (14) 
then equation (13) reduces to 
2r 
foli mro p n 15 
mach (15) 


Equation (15) is the Spearman-Brown formula used in determining 
the reliability of the whole test from that of the half-test, i.e. 


А ла 7717. (16) 
1+7; 

where у» denotes the reliability coefficient for the whole test and rj; 

denotes the reliability coefficient for the half-test. 

In using the split-half method of estimating the reliability of the 
test, we use 7;—rx,y, (ОГ r1;=Tx.y,) and substitute this value in 
equation (16) to obtain the estimate of the reliability of the test. In 
so doing, we make the assumptions shown above in equation (14) and 
also the implicit assumption that ry,y, Or ry,y, as the case may be, 
is an unbiased estimate of the common r ofequation (14). Experience 
has shown that the assumption of equal standard deviations, i.e. 
Sy, = Sy, = Sy бу, 7S, із generally satisfied in practice, but, of 
course, the validity of this assumption should be tested in each case. 
On the other hand, experience has also shown that the six intercor- 
relations are seldom, if ever, equal and that in particular zy,y, or 
тугу, may be biased estimates of the assumed common value. The 
following example relates to the results for a small group of 56 pupils 
on two forms of an intelligence test, and shows clearly the kind of 
results which may be obtained. We found: 


Tw 
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Tx,x, 70.585 Sx, —4.523 


Тух. =0.652 $у, =4.247 
x,y, =0.707 5х, =4.677 
tyy, 70.743 Sy, —5.089 
7x,y, 70.7066 712, = 0.765 


тулу, 0.172 
The standard deviations are all of the same order of magnitude, hence 
the assumption of equal standard deviations may be accepted as valid. 


If we assume only that the standard deviations ar 


е equal, equation 
(13) reduces to 


Tx "хау тух түу; 
= XX: Toy. Ti ty, 


Та = РПЛ a M (17) 
2V(1+rx,v)(1+rx,y) 
Substituting the values of the six correlation coefficients in (17), we 


find 7z,z,=0.760, which is an additional demonstration that the diffi- 


culty does not lie in the assumption regarding the equality of the 
standard deviations. Using the values of rxy, and rx,y, in equation 
(16), however, we find: 

for 7:;=rx,y,=0.766, 72, =0.867 

for 7, =rx,y,=0.772, 12,2, =0.871 
which differ considerably from the observed value of 0.765. 
The error, in this case at least, lies in using ry,y, 
estimate of the common coefficient of correlation of equation (14). 
Let us examine more closely the procedure underlying the determina- 
tion of rx,y, (the position is the same, of course, for rx,y,). The values 
of Хи and Yi, are determined, as explained above, from the scores 
on the odd and even items of the test. The formula generally used! 
in calculating 7x, y, is 


Or 7y,y, аз ап 


Z(XuYi) = Btw BE) 


зауы A S ENNIO E mmc о 1 
с^ N 5,5, (18) 
We can see more clearly where the difficulty lies, however, if we write 
equation (18) in the following form: 


2 2 
Энни — S-ra 


| Tan = ese} ет (19) 
where d 
Six: +y) = SA Tec M Er (20) 
T x -ажххыи- Yu? 
Я = E: Yu) "Ens EC | E e (21) 


1See Appendix A for a discussion of the procedure to be followed in esti 


mating 
reliability coefficients in such cases. 
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The differences X1,— Yi: аге not independent of the total scores, in 
fact for large and for small total scores these differences must of 
necessity be small. This means, therefore, that when the proportion 
of small or large total scores'in the sample is increased 50а) must 
become smaller. At the same time, of course, we may increase the 
other term, S{x,+y,, in the numerator of equation (19) so the net 
result will be a spurious increase in the value of ry,y, This element 
of "spuriousness" will generally be present in the correlation of the 
scores on the odd and even items of a test and seems to be an inherent 
weakness of the method. The magnitude of the spurious element will 
always depend on the distribution of total scores on the test and in some 
cases it may not be important. 

We tried to develop a correction term to allow for the bias involved, 
but we were not successful. The nature of the relationship between 
бё, and the total scores on a test is not difficult to determine. 
The results for three different types of distributions of total scores are 
shown in Tables XI, XII and ХИ. In each of the three cases, of 
course, we kept the first term, Sixty) in the numerator of equation 
(19) constant. 

In the first case, we chose a rectangular distribution of total scores, 
actually 5 papers for each total score from 6 to 60 inclusive (the total 
number of items on the test used was 75). Each group of five con- 
secutive scores (i.e. 25 papers) was then treated as a unit, the scores 
made by each individual on the odd and even items found and the 
value of Sx у) calculated for each such unit. The results are shown 


in Table XI. 
TABLE XI 


VALUES OF S'(x,- үр) FOR VARIOUS TOTAL SCORE GROUPS: 
RECTANGULAR DISTRIBUTIONS, 25 PAPERS IN Елсн GROUP 


Total Score Groups Values of 5*(х,- yi) 

6-10 76 
11-15 10.2 
16-20 13.4 
21-25 14.8 
26-30 16.6 
31-35 14.8 
36-40 12.5 
41-45 13.4 
46-50 10.5 
51-55 7.8 
56-60 4.6 
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These results show a definite relationship between Зе апа the 
total score; for larger and smaller scores than those considered Sixi-vo 
will continue to decrease, reaching the minimum value of zero for total 
scores of 0 and 75. 

In practice we do not find rectangular distributions of scores like 
those in the example considered above. For this reason, Бе 
we chose normal distributions of total scores in each unit for the 
following two examples. The results are shown in Tables XII and 
ХИТ; in the first case the distributions did not overlap, but in the 
second they did. We used 100 Papers in each group of total scores 
for these cases, not 25 as in the Previous case, 


TABLE XII 


Yı) FOR Various TOTAL Score Groups: 
5, NO OVERLAP, 100 PAPERS IN EACH Group 


VALUES оғ S%y,— 
NORMAL DISTRIBUTION. 


Total Score Groups Values of Six 


a) 
6-20 10.1 
21-35 14.3 
36-50 12.4 
49-63 8.4 
TABLE XIII 


VALUES OF 5*(х,_\1) ron VARI 


OUS Тотлі. Score Groups: 
NORMAL DISTRIBUTIONS, 


OVERLAPPING, 100 Рлрвнв IN EACH GROUP 


Total Score Groups Values of S5, _ Y) 
1-35 12.4 
18-52 14.4 
41-75 8.1 


- The results here are similar to those shown in Table XI, but the 
differences are not as great. It is clear, however, that this effect 
will influence our estimates of the reliability Coefficients, and in par- 
icular our estimates of the standard error of measurement. It fol- 
кык fore, that we cannot use the analysis of variance method, 
lows; ое S di Six, y is independent of the scores, іп analysing 
сем 2 ҰНЫ Емі Мо matter what method we use in analysing data 
gree кр care must be taken in the interpretation of the 
[s] 1S , 
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results. It is doubtful if the split-half method сап be used with any 
degree of confidence in estimating reliability coefficients. The esti- 
mates so obtained may be used as measures of "internal consistency” 
but not in all cases as measures of "reliability". 

The results of two other experiments, which will now be considered, 
indicate that there are other factors, especially the test content, influ- 
encing the split-half estimates. It seems that no generalization can 
be made for all tests; each test must be considered separately. 


Note: 

Formula (19) may be used in the calculation of any correlation 
coefficient, and applies, with a slight change of notation, to the esti- 
mation of the reliability coefficient by the test-retest and comparable 
forms methods. It shows clearly the effect of selection of the group 
tested on the estimate of thé reliability coefficient. Since the second 
term in the numerator is practically constant for a particular test, 
we may even find negative values of rx,y, if the differences between 


the individuals tested are very small. 


Experiment Il. Comparison of Comparable Forms, Test-Retest and 
Split-half Estimates of Reliability. 

Although many comparisons of the different estimates of relia- 
bility have been reported in the literature,? the experiments were not , 
designed for the specific purpose we had in mind. We wished to com- 
pare the different estimates of reliability and at the same time to vary 
the length of time elapsing between the tests in order to determine 
what, if any, influence this had on the results. The Advanced Do- 
minion Group Test of Intelligence? was chosen for this experiment; 
two comparable forms, A and B, were available, each consisting of 
15 items of varying degrees of difficulty. For the test-retest experi- 
ments, Form B only was used. The tests were given to pupils in 
grades 9, 10, 11 and 12 but in the results given below no allowance has 
been made for between-grades differences as we were not interested 
in this particular factor. Each pupil was given either the two forms 
of the test or the same form twice, and the length of time elapsing 
between tests varied from a few minutes to 24 hours. The plan of 
the experiment is shown in Table XIV. It was impossible to obtain 


2See Chapter I. 
3Published by the Department of Educational Research, University of Toronto. 
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TABLE XIV 


PLAN OF EXPERIMENT 


Number of Cases in 
Time tests were given Comparable forms 
Test-retest group group 
In consecutive periods................ 146 86 
In morning and afternoon of same day.. 214 100 
In corresponding periods of consecutive 
CENCE e atio o NM 189 249 


equal numbers of pupils for each group, 


butthis does not matter much. 
For each group we have, usin 


g the usual correlation technique, 
two split-half estimates and a test-retest or comparable forms estimate 


of the reliability of the test. These estimates are given in Table XV. 
The test-retest estimates are consistently higher than the comparable 
forms estimates, and the split-half estimates are generally higher than 
the comparable forms and lower than the test-retest estimates. Ап 
interesting point is the increase in the split-half estimates from the 
first to the second form for both test-retest and comparable forms 


TABLE XV 


COMPARISON ОЕ Test-Retest, COMPARABLE FORMS AND 
SPLIT-HALF ESTIMATES OF RELIABILITY 


Reliability Coefficients 


Time tests Split-half Split-half 
were given Test- Comparable|— — . — 
retest | Form B | Form B Forms First Second 


Form B | (First) | (Second) |Forms A & B Form (A)| Form (B) 


Consecutive| 
periods..... 0.937 0.880 0.912 0.839 0.833 0.889 
Morning and 
on of 
vd. та ..| 0.914 0.882 0.909 0.881 0.921 0.910 


d. 0.932 | 0.889 | 0.923 0887 | 0.901 | 0.923 
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groups. А similar change has been noted іп other cases, but very 
often we find a decrease instead of an increase. This effect seems to 
be caused by practice lengthening or shortening the effectual length* 
of the test. If the test is rather easy in the first place, we generally 
find a decrease; if the test is a little too difficult, we tend to find an 
increase. 

If we use the analysis of variance method in analysing these data, 
we find a simple explanation of the difference between the test-retest 
and comparable forms estimates. The complete results of this 
analysis are presented in Table XVI. The values аге arranged to 
assist in the comparison of results for each group on the two experi- 
mental methods, and between groups for each experimental method. 

The error variance seems to be independent of the length of time 
elapsing between the tests. In each case, however, the test-retest and 
comparable forms estimates of error are significantly different; the 
comparable forms experimental method yields consistently higher 


estimates. The tests of significance of the differences are made as 


shown below [see 53]. 
1. Consecutive periods 


° 
Calculate F = pn 1.7 and refer to Snedecor's tables of F with 


degrees of freedom 11 —85 and л» = 145. 
2. Morning and afternoon of same day 


Calculate F= Lets 1.4 and refer to Snedecor's tables of F with 


10.3 
degrees of freedom 71 —99 and пз —213. 


3. Consecutive days 


Calculate F= To 1.5 and refer to Snedecor's tables of F with 


10.6 
degrees of freedom и: —248 and пз = 188. 
In all cases the differences are significant, so we may conclude that 
different experimental methods give different estimates of the errors 


of measurement. 
one a= 


‘By this is meant the number of items which are used by the individuals, not 
necessarily the number of items on the test. It is interesting to note that we have 
no exact measure of the *'true" or “‘effectual’”’ length of a test; at least not as far as 
the authors are aware. The reader can easily convince himself that the number of 
items composing the test is a poor measure of its length by considering a simple case. 
For very easy or very difficult tests, clearly the discrimination between individuals 
is obtained on relatively few items, the remaining items being useless as far as the 
purpose of the test is concerned. In these cases the number of items composing 
the test may bear little or no relation to its “‘true’’ or “effectual” length. 
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TABLE XVI 


COMPARISON OF RESULTS OF THE ANALYSIS OF VARIANCE OF DATA OBTAINED BY 
USING ТЕЗТ-ВЕТЕЗТ AND COMPARABLE FORMS EXPERIMENTAL METHODS 


7 
Test-Retest Comparable Forms 
Time tests | Variance | Degrees | Sum Degrees | Sum 
were given of of Mean of of Mean 
e Freedom | Squares | Square Freedom Squares | Square 
Between В 
Trials 1 3,563 | 3,563 1 1,856 | 1,856 
Consecutive} Between 
Individuals 145 36,154 249 85 14,368 169 
periods - 
Еггог 145 1,335 | 91 85 } 1,321 15.5 
Total 291 41,052 EN 171 17,545 = 
Between ' 
Trials 1 3,587 | 3,587 L 1,480 | 1,480 
Morning 
and Between 
afternoon | Individuals! 213 | 45,596 | 214 99 | 22591 | 228 
of same = 
day Error 213 2,187 | 103 99 1457 | 14.7 
"Total 427 51,370 | — 199 25,528 | —À 
Between | 
Trials 1 6,638 | 6,038 1 953 953 
Consecutive| Between › 
Individuals 188 52,055 277 248 65,325 263 
days 
Error 188 1,994 | 106] 248 3,940 | 15.9 
Total 3:7 | 60,687 | — | 497 | уо 1 


It is difficult to determine the cause of this effect. It may be due 
to the memory factor entering in the test-retest method but it is more 
likely that, in spite of the care taken in constructing the test, the two 
forms were not exactly comparable. ° P 

With regard to the other effects, practice appears to be more impor- 
tant in the test-retest method—which may account for part of the 
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difference discussed in the preceding paragraph. We note also that 
for some reason the first comparable forms group was more homo- 
geneous than the others. This does not affect the between trials or 
error estimates for this group, of course, but it does affect the estimate 
of the reliability coefficient and hence explains the lower values shown 
for this group in Table XV. 

On the basis of all these results, we may conclude that the estimates 
of the reliability of a test obtained by the use of different experimental 
methods are not exactly comparable. In reporting on the reliability 
of a test we should, therefore, state which experimental method was 
used and if possible give results for each method. It is clear that we 
cannot compare the reliability of different tests unless complete and 
detailed information on these points is available. 


Experiment Ill. йон of Test-Retest and Split-half Estimates 
of Reliability for a Battery of Sub-tests. 

Tests may be composed of a large number of items, of the same or 
different content, or of a series of short sub-tests of different content. 
In the above two experiments we used tests of the first type but for 
this experiment we chose a test composed of six relatively short sub- 
tests of different content. We wished to determine the relationship 
between the test-retest and split-half estimates of reliability separately 
for each sub-test and for the whole test and, in addition, the effect of 
varying the time between tests on these estimates and their relation- 
ship. These estimates have also been compared with the estimates 
given by the application of Kuder arid Richardson's formula (20) [63] 
for two of the groups of children considered. Other questions relating 
to the reliability of a battery of tests or sub-tests have been considered 
in another section of the bulletin. 

The test chosen for use in this experiment was the Revised Beta 
Examination prepared by C. E. Kellogg and N. W. Morton of McGill 
University, Montreal, Canada. The test is composed of six short sub- 
tests (also six exercises, one for each sub-test) and the material is non- 
verbal in content. The content of each sub-test and maximum score 
are shown in Table XVII; the total score is obtained by adding the 
unweighted sub-test scores. 

The group of children tested consisted of 5 classes of GradeIX 
pupils in an Ontario school. Altogether 175 pupils were tested, but 
only 156 of these were present for both the test and retest. The interval 
elapsing between the tests was one-half day, 1 day, 8 days, 1 week and 
5 weeks for the classes designated A, B, E, Dand C, respectively. The 
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TABLE XVII 


CONTENT AND Maximum Score ог EACH SUB-TEST 
or REVISED BETA EXAMINATION 


Sub-test Content Maximum 
Score 
1 10 
2 30 
3 Common-sense picture discrimination (picture absurdities) 
(20 items) 20 
am 18 
E 20 
6 25 


classes, testing dates, number of pupils, mean scores and standard 
deviations for each sub-test are shown in Table XVIII. The tests seem 
to be too easy for Grade IX pupils and, partly for this reason, the 
spread of scores is not large (for the whole test, the standard deviations 
varied from 6.52 to 9.90). As homogeneous grouping is not used in 
this school, there is very little difference between the classes. 

There is, unfortunately, no general pattern evident in the results 
except for the practice effect, but even this varies considerably. The 
standard deviations on the second trial are both larger and smaller 
than those on the first trial. Except for those cases in which the 
average score on the second trial was very high compared with the 
number of items on the test, c.g. in the case of sub-test 6 in Class С 
it is impossible to determine just what is the net effect of practice. In 
some cases it seems to increase, and in others decrease, the “effectual” 
length of the test. Аз far as determining a measure of the true length 
of the test is concerned, these results indicate that some function of 
the standard deviation should be considered. 

The position is much clearer when we consider the comparison of 
the test-retest and split-half estimates of reliability. The results are 
rly consistent for each sub-test but they vary from one sub-test to 

For this reason, therefore, the data given in Table XIX are 
ое by sub-tests, not by classes as in Table XVIII. 
„=? 1, 2 and 6 form a group giving similar results. In these 
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TABLE XVIII 


REVISED BETA EXAMINATION: MEAN SCORE AND STANDARD 
DEVIATION OF SCORES ON EACH SUB-TEST (Bv CLAssEs) 


Number Mean Score Standard Deviation 
Class| Testing of Sub-test 
Dates | Pupils lst Trial | 2nd Trial | Ist Trial | 2nd Trial 
1 7.7 8.0 1.41 1.09 
Dec. 15th 2 23.3 25.4 3.18 3.77 
A.M. 3 13.3 14.2 2.99 3.11 
A 32 4 10.0 11.5 2.88 3.12 
Dec. 15th 5 14.8 16.3 2.45 2.11 
P.M. 6 19.8 20.4 2.35 2.38 
1 7.6 8.5 1.19 1.22 
Dec. 14th 2 22.3 25.3 3.29 2.93 
B 34 3 12.8 14.4 2.26 2.46 
Dec. 15th 4 9.8 11.5 3.04 3.17 
5 14.8 16.5 2.74 2.43 
6 18.9 20.5 2.29 2.29 
1 7.6 8.6 1.16 1.25 
Рес. 15th 2 21.1 24.4 3.08 3.78 
С 29 3 12.2 14,1 2.16 2.35 
Jan, 22nd 4 10.6 12.2 3.23 3.44 
5 14.4 16.0 2.23 2.25 
6 18.2 23.0 2.55 1.64 
1 441. 8.6 1.34 0.96 
Dec. 14th 2 21.3 25.4 3.33 3.20 
D 32 3 12.8 14.9 2.86 2.49 
Dec. 21st 4 11.0 12.6 2.96 2.55 
5 15.1 17.1 1.95 1.94 
6 20.5 21.0 1.89 2.16 
1 7.7 9.1 1.44 1.06 
Dec. 15th 2 23.7 26.6 4.85 3.03 
Е 29 3 12.5 13.3 2.19 1.75 
Dec. 18th 4 10.7 12.6 3.13 3.24 
5 14.4 16.7 2.28 1.84 
6 20.6 22.2 2.53 2.15 


cases the split-half estimates, except for two cases for sub-tests 1 and 6, 
are consistently higher than the test-retest estimates. The explan- 
ation of this difference seems to lie in the content of the sub-tests. In 
sub-test 2, for example, the items are all of the same difficulty and are 
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TABLE XIX 


REVISED BETA EXAMINATION: COMPARISON OF TEST-RETEST AND 
SPLIT-HALF ESTIMATES OF RELIABILITY (BY SUB-TESTS) 


Reliability Coefficients 


Sub-test Class E Split-half* 
-Retest 
тинен Ist Trial 2nd Trial 
A 0.469 0.692 0.238 
B 0.363 0.407 0.407 
1 с 0.301 0.409 0.442 
р 0.498 0.714 0.410 
E 0.562 0.591 0.713 
A 0.714 0.949 0.986 
B 0.467 0.967 0.947 
2 c 0.591 0.984 0.978 
D 0.842 0.974 0.983 
E 0.686 0.979 0.966 
" 0.897 0.708 0.724 
B 0.713 0.591 0.825 
5 С 0.566 0.398 0.507 
D 0.676 0.590 0.567 
E 0.560 0.378 0.091 
A 0.878 0.712 0.833 
B 0.789 0.758 0.813 
7 С 0.811 0.895 0.811 
р 0.826 0.796 0.560 
Е 0.769 0.825 0.812 
А 0.801 0.752 0.673 
B 0.868 0.828 0.680 
5 © 0.661 0.566 0.642 
р 0.727 0.502 0477 
Е 0.851 0.501 0.665 
5 0.660 0.782 0.781 
B 0.539 0.779 0.790 
Ë С 0.091 0.793 0.306 
р 0.848 0.533 0.817 
Е 0.674 0.731 0.833 
А 0.920 0.916 0.915 
B 0.773 0.936 0.908 
i c 0.623 0.876 0.848 
tote D 0.856 0.851 0.810 
E 0.841 0.855 0.847 
*For whole test. 66 


of such a nature that if a pupil gets one item correct, he is almost 
certain to obtain the correct answer for the adjoining item. A some- 
what similar situation exists for sub-tests 1 and 6 but here the effect 
is not so marked. Clearly, for tests of such content the split-half 
method should not be used in estimating the reliability of the test. 

Sub-tests 3 and 5 form another group but in this case the split-half 
estimates are consistently lower than the test-retest estimates (except 
for one value for sub-test 3). These sub-tests are similar in content; 
in З the pupil is asked to mark the absurd picture while in 5 he is asked 
to complete the drawing. Since the items in these sub-tests are not 
arranged properly in order of difficulty, the two halves of the test 
formed by the odd and even items are not exactly comparable. This 
effect may lower the split-half estimate. On the other hand, the items 
are of such a nature that the pupil is likely to remember the answer 
given on the first test. This effect would tend to increase the test- 
retest estimates. It is probable that some combination of these two 
effects accounts for the observed differences in the results. 

In the case of sub-test 4, the test-retest and split-half estimates 
agree very well; the split-half estimates are higher in half the cases 
and lower in the other half. The items of this sub-test seem to be 
satisfactorily arranged in order of difficulty, except possibly item 7, 
and it is unlikely that the pupils could remember very well the answers 
given on the first testing. 

A comparison, for two of the classes, of the test-retest, split-half 
and Kuder-Richardson estimates of reliability is shown in Table XX. 
The Kuder-Richardson estimates, which are really estimates of the 
internal consistency of the sub-tests, agree better with the split-half 
than with the test-retest estimates. This indicates that the split-half 
method gives estimates of the internal consistency of the test, which 
may be very different from the reliability of the test as discussed in 
the earlier sections of this bulletin. Clearly, the different methods 
give results which are not always comparable, and in some cases it is 
difficult to determine exactly what is measured. 

A comparison, for the test as a whole, of the test-retest and split- 
half estimates of reliability is given in the last section of Table XIX. 
The estimates agree fairly well for three of the classes, A, D and E, 
but differ considerably for the remaining two. It will be realized, of 
course, that strange results may be obtained when we combine the 
unweighted scores for sub-tests of such different content. 

The Revised Beta Examination seems to behave rather erratically. 
Considenng only the test-retest coefficients, for example, the estimates 
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ТАВГЕ ХХ 


REVISED BETA EXAMINATION: COMPARISON OF TEST-RETEST, 
SPLIT-HALF, AND KUDER-RICHARDSON ESTIMATES OF 
RELIABILITY (FOR 2 CLASSES) 


Reliability Coefficients 

Class | Sub-test В Split-half Kuder-Richardson 
TE. lst Trial | 2nd Trial 1st Trial 2nd Trial 

1 0.363 0.407 0.467 0.260 0.394 

2 0.467 0.967 0.947 0.943 0.912 

3 0.713 0.591 0.825 0.437 0.575 

Я 4 0.789 0.758 0.813 0.743 0.755 

5 0.868 0.828 0.680 0.700 0.666 

6 0.539 0.779 0.790 0.640 0.607 

Dre ir 0.562 0.591 0.713 0.568 0.499 

@ 0.686 0.979 0.966 0.968 0.928 

3 0.560 0.378 0.091 0.374 0.011 

E 4 0.769 0.825 0.812 0.766 0.781 

5 0.851 0.501 0.655 0.566 0.565 

6 0.674 0.731 0.833 0.686 0.603 


of reliability vary from 0.623 to 0.920. "These differences might, of 
course, be due to differences in the variability of the individuals in 
different classes with respect to the ability measured by the test. The 
analysis of variance of these data given in Table XXI, however, shows 
that while there may be certain differences between the classes with 
regard to this factor, more significant differences occur in the estimates 
of the errors of measurement. The mean square for error varies from 
7.79 in Class D to 28.98 in Class C, and these changes are not related 
to the length of time elapsing between the tests (the practice, or be- 
tween trials, effect is affected by this factor). The samples are small, 
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of course, but even taking this into account, it is surprising to find 
significant differences in estimates of the errors of measurement. 


The conclusions which follow from these experimental results are 
summarized below: 


(1) 


Different experimental methods of measuring reliability 


/ yield results which are not always comparable. 


м) 


The Kuder-Richardson and split-half methods give measures 
of the internal consistency of the test; this may or may not be 
the same thing as the reliability of the test, 


(8) For certain experimental methods, the content of the test 


(4) 
(5) 


(6) 


(7) 


affects the estimates of reliability. 


Practice may lengthen or shorten the "true" length of the test, 
and hence affect the reliability coefficients. 

The length of time elapsing between tests seems to have little 
effect on the estimates of reliability (except for the practice 
effect noted above). 
The number of items composing a test is not a very efficient 
measure of the “true” or “effectual” length of a test. We need 


a better definition and measure of length than has as yet been 
proposed. 


The estimates of errors of measurement 
always, constant for a particular test, 
by the particular experimental method 


are generally, but not 
They also are affected 
employed. 
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СНАРТЕК У 


THE ESTIMATION OF TEST RELIABILITY BY THE METHOD 
OF RATIONAL EQUIVALENCE 


Among recent developments in the theory and estimation of test 
reliability isa method developed by G. F. Kuder and M. W.Richardson 
[63, 88] whereby the reliability of tests may be estimated, on the basis 
of certain assumptions, from item analysis parameters. This method, 
described by its authors as the method of rational equivalence, is 
developed on a foundation of fundamental test structure theory, which 
renders explicit the essential determiners of test functioning. The 
basic observation underlying the method is that both test variance 
and test reliability, relative to a defined population, are functions not 
only of the individual item variances and item reliabilities but also of 
the inter-item covariances. The inter-item covariances are in turn in 
part a function of the item difficulty values and the item content, two 
of the basic determiners of a test’s internal consistency. 

The term rational equivalence results from an operational defini- 
tion of equivalence deduced from formulae for the correlation of sums. 
A test Z) is presumed to be equivalent to a hypothetical parallel form 
Zy when every item $ of Z; is interchangeable with a corresponding 
item 2' of Zv, every pair of items being similar with respect to diffi- 
culty and content. The further assumption is that all corresponding 
correlations are equal. Thus rational equivalence is defined such that 
1=1, апат: =7 о; =7 р. Relevant to the above is the observation that 
precisely similar assumptions underly the Spearman-Brown formula 
for estimating increased reliability with increased length of test. Some- 
what broader assumptions, however, underly the Spearman-Brown 
formula when it is used to estimate reliability by augmenting the cor- 
relation between split-halves. 

Now in the derivation of their formulae Kuder and Richardson 
made certain assumptions, over and above the equivalence condition, 
which if necessary for final proof would detract seriously from the 
practical usefulness of their method, since many of the conditions 
which they found it necessary to specify are rarely if ever attained in 
practice, or for that matter even roughly approximated. Further- 
more a few of their assumptions are inconsistent one with the other. 
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Since several of these formulae are rapidly coming into use for the 
estimation of the reliability of tests, it is of some moment to examine 
these formulae carefully, and to re-derive several of them on a greater 
parsimony of specified condition. Indeed it may be shown that the 
Kuder and Richardson formula (20), which was found empirically to 
yield very satisfactory results, may be derived on the basis of the 
equivalence assumption only. In general we may state that these 
authors fell into the common error of specifying conditions that were 
sufficient but unnecessary. 

We may write the intercorrelations between the т, test items of Z, 
and the лу assumed equivalent items of Zy 
square [112], weighting each item according 
as follows:5 


in the form of a pooling 
to its standard deviation, 
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‘The standard deviation of a dichotomously scored test item, 2, is given by the 


formula — 
si=V bigi 
where 5; = the standard deviation of item £. 
р; = the proportion of persons passing item 1, 
qi = the proportion of persons failing item z. 


The correlation between any two dichotomously scored test it 


ems is understood in 
the present paper to be given by 


2 Әйт Бір) 
КЕЛҮ? 


fij 


where pij=the proportion of persons passing both items ; and j. 
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The equivalence assumption implies that in the above correlation 
matrix 7;;—fyy—7y5. The sum of the weighted elements in the 
upper left hand quadrant of the above matrix is the variance of 21, the 
sum of the weighted elements in the lower right-hand quadrant is the 
variance of Zy, and the sum of the weighted elements in either the 
upper right-hand quadrant or the lower left-hand quadrant is the 
covariance. Now, since the correlation between two tests is given 
by dividing the covariance by the square root of the product of the 
two variances, we may write 


2»; Si S +2 rues; 


P аз ij Gi) 
[ES +25 ry S; S] [Z.S? +20 тру 55] 
ij G<) ij (i<j) 
where ти = reliability of test, 
ri; = correlation between items 2 and 7, 
S; = standard deviation of item 1, 
S; = standard deviation of item 7, 
rj; = reliability of item 2. 


But on the basis of the equivalence assumption 


rj —fyp—fy 


and I S;2S,; S;= S; 

consequently 2Уғ;55;- Wy p SS, = 2X ri 55у 

апа р XSj-YXN 

therefore Тит NES аб) i E (22) 
t 


where 15,2 is the variance of the test. 
Formula (22) is the basic formula underlying Kuder and Richard- 
son’s discussion of test reliability. 
Another interesting derivation of formula (22) is given here. The 
error variance of a single test item may be found by the formula 
Т LÀ Lunata (23) 
where 52; is the error variance of item i. On the assumption that 
errors of measurement are uncorrelated, the error variances of the т 
items on a test may be summed to give the error variance of the whole 
test. Thus 
У „+, a ethene (24) 
where S? is the error variance of the whole test. 
But the usual formula for the error variance of a test score is 
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S=Si—ry 152 
2 
һепсе ти=1— s МЕНТІ (25) 


t 


Substituting (23) in (25) we obtain 


5—5; Xs 
S 
which is identical with formula (22). 
We could now by making certain assumptions estimate the term 
DiS}, and derive the Kuder-Richardson formula (20). This for- 
mula, however, is capable of derivation by more direct means. 


The variance of a test of n items, written as a function of the item 
variances and inter-item covariances, is as follows: 


5: 73:5 -23 55,5; 
ij (i<j) 


=n; + O 4-2... (26) 


fit 


гез 5 š 
where $; = average item variance. 


75,5; = average item соус: їапсе. 


If we assume that there exists а h 
test, also composed of items, then t 
two tests will be 


ypothetical parallel form of the 
he variance of the sum of these 


57 =2nS} 2n (2n — 1)5,5,5; 


where S? is the variance of the sum of sco 
forms. We know, however, from formulae 
that 


res on the two equivalent 
for the correlation of sums 


5%-25(1--,) 
where rw is the reliability coefficient; that is, t 
a test and its hypothetical equivalent form. 


Substituting the values of S? and 5% from (26 
tively in (28), and solving for ги, we obtain 


5-5,5} 
== ЖЕТІСІ” 5405227. (29) 
This formula is identical with the Kuder-Richardson formula (20). 
If we examine the assumptions in the above derivation, we see that 
they are not quite identical with the equivalence assumption as pre- 
viously stated. The equivalence assumption specified that r; =т= 
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he correlation between 


) and (27) respec- 


Tum 


Тұр and S;—S;. Неге we have made a somewhat less rigorous 


assumption, namely that 7;;5;5; =f pS; Sp = 735,5. Thus we have 
assumed that our covariances are oz the average equal. 

The average inter-item covariance of a test is known exactly, and 
is of interest as a statistic descriptive of a test’s internal consistency. 
It may vary within the limits —.25/n—1 and .25. Similar observa- 
tion indicates that the average inter-item correlation can never be less 
than —1/n—1. This observation applies not only to test items but 
also to test batteries. Hence when the number of items is large it is 
mathematically almost impossible to obtain a negative average inter- 
item correlation. 

If there is reason to believe that all test items are of equal difficulty, 
then the term 315}, which is of course ир;а;, will be equal to иф;а;, and 
formula (29) may be written 


n Si—nbq 
mun 2с АЕБ АА 30 
Ти nci 52 (30) 
where $= x UT CUNEO (31) 
"n 


A number of formulae employing item-test correlation may be 
derived on the basis of the observation that 


DSS) - Meet (32) 


Such formulae involve somewhat more arithmetical labour than for- 
mulae already given, but represent no improvement in the estimation 


of test reliability. 
Jackson [52] has suggested that the accuracy of measurement of a 
test should be described in terms of a sensitivity statistic ү, defined as 


<s a2 
y= VES CN у: (33) 


Hence the sample value of v is related to the reliability of a test by 
the formula 
y= ТА ES є Ж (34) 


ma 


Substituting formula (29) in formula (34) we obtain a value of y 
estimated by the method of rational equivalence; thus 


2 
у= r OU > 3 GAVE be ж (35) 
п; 5—5 
ізі 
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The statistic y thus calculated may be used in the determination 
of a probability of making an error of measurement by chance alone 
greater than or equal to V.S?—S? units. 

Formula (29) above is identical with the Kuder and Richardson 
formula (20) which those authors found on the basis of empirical evi- 
dence to yield very satisfactory results. Their derivation, however, 
required the assumption that the matrix of inter-item correlations 
have a rank of 1, and that all the inter-item correlations be equal. 
These assumptions while sufficient are unnecessary, and indeed if they 
were necessary the formula would be of little practical value since few 
tests approximate to these conditions. 

As mentioned previously several of the formulae derived by Kuder 
and Richardson are based on assumptions that are incompatible one 
with the other. Their formula (14), for example, 


. = 2 
3 д x QS 
(25-25 5 


fw = 


is presumably derived on the assumption that the rank of the matrix 
of inter-item correlations is 1, and that all the 
equal. The difficulty values of the items are allowed to vary over a 
wide range. Now if the items are homogeneous with respect to diffi- 
culty and content, then all the intercorrelations will be approximately 
equal, and the inter-item correlation matrix will have a rank of 1. If, 
however, the items are heterogeneous with respect to difficulty the 
intercorrelations will not be equal, since the correlation between two 
test items is not independent of their difficulties. In general the 
greater the difference in difficulty between two test items the smaller 
the correlation between them. Furthermore, if the items are hetero- 
geneous with respect to difficulty, although homogeneous with respect 
to content, the matrix of correlations will not be of rank 1, since differ- 
ences in difficulty are represented in the factorial configuration de- 
scribing the matrix of inter-item correlations as additional factors. 
Hence we see that the assumption of rank 1 and equal intercorrelation 
is incompatible with the provision that the difficulty values of the 
items be allowed to vary over a wide range. 

The Kuder-Richardson formulae furnish useful Statistics in de- 
scribing the properties of tests even although the equivalence assump- 
tion is not satisfied. Under such circumstances, however, we are not 
justified in describing the obtained coefficients as reliability coeffi- 
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intercorrelations are 


cients, since implicit in the reliability concept is the idea of repeated 
measurement. If we are unwilling to make the equivalence assump- 
tion we may refer to the coefficients obtained by the Kuder-Richardson 
formulae as consistency coefficients; that is, coefficients descriptive of a 
test's internal consistency. Reliability coefficients are, therefore, 
identical with consistency coefficients when the equivalence condition 
is satisfied. 


“ 
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CHAPTER VI 


BATTERY RELIABILITY 
Battery Reliability 


The reliability of test batteries may be conveniently calculated 
from formulae for the correlation of sums. 


If 21, 20,...4.8, are п 
initial tests and zy, zy, . 


«++ Zw are corresponding alternative forms, 
then the correlation between the simple sum of scores on the initial 


tests and the sum of scores on the alternative forms may be written 


Dri S, Sy +2 Drips: Sp 
R= Iu D (37) 
VI[ESI-22:055:5;] (553-925, 5,51 
ij (i<j) P Ge 
where ri, = reliability of test z;, 
5; = standard deviation of test Su 
5, = standard deviation of test 2“. 
If we can assume that See А РН (38) 
апа ВУ ПЕНИИ ti L uiros (39) 
then the reliability of the battery may be written 
Хғ Si + 2Уғ;5;5; 
Te ij G<) 
XS pr 2X5; S See cm (40) 


ij Ge 
Formulae (37) and (40) relate to the reliabilit 
of scores оп tests. If the scores on each test are 

measure formula (37) may be written as 

Ут 2D re 
JR" 2H Bj (i<j) 
Ура а 7008 (41) 
1,J (t <j. "UG 


PG 


y of the simple sum 
expressed in standard 


d formula (40) as 
ue Же 21; 
iiGen 
2132554 b ee. (42) 
БЕРУУ 
Formulae (41) апа (42) give the reliability of the sum of standard 
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scores. If the tests in our battery are weighted according to апу 
system of weights, Ші, ws, .. . . Wn, then the reliability of the sum of 
weighted scores is given by 
Утуш +2 т» id 
Pm ij 252 s =. (48) 
Ми +2Dr30;0,) [5227 rgj usw] 


ij (i<j) T Gen 


If we make the equivalence assumption of equation (39), formula (43) 
may be written as 
Erow; + ин; 
3,2 GSS. 
SF Жш -Е2 ғуеан | Cw 9 
ij (<j) 
Formulae (40), (42) and (44) may be written in such a form that the 
battery reliability may be obtained without calculating all the cor- 
relations between tests. We require, however, the variance of the 
sum of scores, the test variances, and the test reliabilities. The 
variance of the sum of raw scores may be written 
SSIS ADi SiS „ш Ки (45) 
i,j (i<j) 
where S% is the variance of the sum of raw scores. Hence formula 
(40) may be written 
_ 58-75 757 


R oe n iie EIN (46 
SR ) 
The variance of the sum of standard scores is given by 
St-nd-2975 a _ 42 (47) 
ij (<j) 
Hence formula (42) becomes š 
= т 
Re Soe Чу ИННЕК (48) 
Ss 
The variance of the sum of weighted scores may be written 
5 =w; ии, eevee ee (49) 
ij (<j) 
We may, therefore, write formula (44) as follows: 
2 2 2 
R= Shey A ee у. (50) 


5% 
The error variance of the sum of raw scores, assuming equations 
(38) and (39), is given by 
SA EI 152—5] ri SP ата (51) 
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The error variance of the sum of standard scores is given by 
55; ci me: п (52) 


and the error variance of scores weighted by any given system of 
weights by 


S. =v; НУ си: (58) 


where Sens Sis and SL denote the error variances of raw, standard, 
and weighted scores respectively. 

For computational purposes in applying the formulae given above 
it is most convenient to write all the intercorrelations in the form of a 
pooling square as described by Thomson [111, 112], and described 
briefly elsewhere in this Bulletin (see p. 72). ` 


The Split-half Reliability of a Test Battery 


The split-half reliability of a test battery may be calculated by 
computing the correlation between the odd and even items of each 
test, boosting this coefficient by the Spearman-Brown formula, and, 
when raw scores are used, applying formula (40). The assumption is 
made that the variances of the odd and even items of any given test 
do not differ significantly. This assumption is, of course, implicit in 
the Spearman-Brown formula. If the variances of the halves of the 
same test differ somewhat the following formula may be used: 

2n 2n 


в 5-19 FAMS 
= Te, Е ЩЕ (54) 
T Sh 


In the above formula Sk is obtained either by applying formula (45) 
2i 


n 
or by straightforward calculation. The term > Si is the sum of the 
2 


variances of the half tests. The summation is, of course, over 2% 


2n a 
values. The term ir. 29 Si is the sum of the self covariances of test 


4$ 2 
halves. There are z values of га. and 27 values of Si, hence each 
2 


value of 7 appears twice in the summation. 
22 


Maximum Battery Reliability 


Regression weights may be assigned to any given battery of tests 
to obtain maximum prediction of an external criterion, that is, given 
a dependent variate zo which is presumed to measure а Specified 
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attribute, and z independent variates, 21, 22,.... 22, Which measure 
characteristics of that attribute, weights obtained by the method of 
least squares may be assigned to the independent variates which will 
maximize the correlation between the scores on the dependent variate 
zo and the sum of weighted scores on the independent variates. When, 
however, there are a number of dependent variates weights may be 
assigned to both dependent and independent variates to maximize the 
carrelation between the sum of weighted scores on the criteria and the 
sum of weighted scores on the predictors, Hotelling [49] furnished a 
least square solution to this problem. 

Thomson [111] applied Hotelling’s solution to the special case of 
determining weights that would yield maximum battery reliability. 
That is, if xi, x2, . . . . х, are scores in standard measure obtained on 
the first application of a test and xy, х»,.... хь are corresponding 
scores obtained by giving the test a second time or by the adminis- 
tration of parallel forms, we may write two linear functions 

Li —hixikExwed . . . Ride 
Lyakyxvthexyt+. . . chwxXw 
and obtain a series of weights kı, kə, . . . ., Ra and ky, Ёо... ., Ey, such 
that the correlation between L; and Ly is a maximum. If, however, 
we make the equivalence assumption, that is, if 7;; = 15; ry; —r;», and 
ki=ky, ka=ky,...., ku =, then the solution of the required weights 
is in some degree simplified. 

None the less the attainment of weights that will maximize the 
reliability of a test battery is a matter of much arithmetical labour. 
If we write, for example, the intercorrelations between two appliea- 
tions of three tests in the form of a matrix, thus, 


21 22 Zy Zy 


2111 Ті Tis | Пі 712 Pig 
22| fi 1 723 | 712 fes "а 
23| ru 7а 1 Tis Таз 783 


Zy| fui fij is | 1 712 Pis 
Zw| Па Yoo fos | 712 1 ros 
Zy| fus fos Раз | fis "а 1 


and denote the four quadrants by 
A 6 


then to attain maximum reliability we must solve the equation 


|CA?C—xu4|l-20 — — — ....... (55) 
where A, is the largest latent root. Then 
ACC LN (56) 


where К is the maximum reliability coefficient. The computation of 
^i with three tests involves the solution of a cubic equation and with т 
tests the solution of an equation of the nth degree. The weights 
ki, kz, .. . , kn which will yield maximum battery reliability are pro- 
portional to the elements in any row of the adjugate of the matrix 
(CA^ C—X4). The above exact solution is а matter of*much diffi- 
culty when more than three or four tests are included in the cal- 
culation. 

We may consider, however, a special case of the above general 
problem. "When all tests in a battery are presumed to measure the 
same attribute, let us say g, and differ only with respect to the accuracy 
with which this attribute g is measured, then the best estimate of a 
person's g is given by the linear function 


ф=Ёух\-ЕЬх+........ TE. . 
where hi, Ёз,...., kn are regression weights, and х1, x2, 


‚. „в are in 
standard measure. 
А = 
But о NE (57) 
А 00 
where 
1 Жа”, . Len 
"п 1 та Тіп 
Ты та 1 
А- . 
fng Tin > $ fal 
and irri h aa ne gut (58) 


Evaluating Ao: (see [61] pp. 212-213) in terms of r,, we obtain 


Aor r [a 7790 ri) eo 0-7] 


But the quantity : 
[ата 729 ... (1—„)]/Аш=и 
where и is constant for all variates. Hence the relative weights to be 
82 


assigned to each test to give maximum prediction of g are given by 


Rye ee 0 
* 1-75, (60) 


But if all our tests are presumed to be a measure of the same attribute, 
and differ only in the accuracy with which that attribute is measured, 
that is, if the matrix of intercorrelations may be explained in terms of 
one general factor and z error specifics, we may write 

=? aao ТЕРІС: (61) 


hence k; 


1—7 


If we are reasonably satisfied that no specific other than error 
specific exists in the factorial configuration describing the matrix of 
intercorrelations, that is, if the equality 

Sees Ыб (63) 


тя 


is satisfied, we may assign to each of our variates the weights calculated 
by formula (62) and obtain a best estimate of a man’s true score, which 
under the conditions specified is identical with a best estimate of a 
man’s g. Furthermore these weights are directly proportional to the 
elements of any row of the adjugate of the matrix (CA~!C—),4), and 
yield, therefore, maximum battery reliability. 

Consider a numerical example. ` Let the intercorrelations between 
three tests be as follows: 


21 2а 33 
81 1 42 .63 
ға | .72 1 .56 
2з | -63 .56 1 


and let the reliabilities of the three tests be .81, .64, and .49 respec- 
tively. Here the weights computed by formulae (60) and (62) are 
identical, and are in the ratio 
1 .469 .290 
The weighted battery has a reliability of .875. The weights obtained 
by solving the equation 
|CA31C—:4| -0 
for №; are in corresponding ratio, the value of X: being .7657, and the 


maximum reliability .875. 
If the condition imposed by equation [63] is not satisfied the 
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three methods yield three different sets of weights, and different bat- 
tery reliabilities are obtained. Weighting scores by formulae (60) or 
(62) will usually, although not always, yield a battery reliability 
greater than the reliability obtained by taking the straight sum of 
raw or standard scores. In general, we may state that as т таги 
or as 75, > ri the battery reliability obtained by weighting according 
to either formula (60) or formula (62) tends towards a maximum. 

In the numerical example given by Thomson [111] the table of 
intercorrelations between three tests was 


21 Z2 Z3 
£& | 1 482 617 
22 | .482 1 .397 
2а | .617 .897 1 


the reliabilities of the three tests being .86, .73, and .83 respectively. 
The weights required to obtain a maximum reliability of .915 are in 
the ratio 


1 36 -76 
The weights obtained by the formula (62) are in the ratio 
1 478 -809 


and the obtained battery reliability is .914. The weights that yield 
a best estimate of g are of the order 


1 .234 420 


and the battery reliability resulting from the use of these weights is 
-910. When equal weights are assigned to each variate the battery 
reliability is .903. In the above numerical example weighting the 
variates to obtain maximum reliability does not increase the accuracy 
of measurement in any great degree. 


Reliability and Factor Patterns 


The reliability of a test may be written as a function 
structure of the tests included in the battery, Cons 
case where the matrix of intercorrelations is of rank 1. 
on the z tests included in the battery are in standard 
battery reliability, R, is given by the formula 

Ernst2X fü 5 
3,2 G <J; 
бе ЖЕ 22527 (64) 
i,j (<i) 
R = the reliability of the sum of standard scores, 
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of the factorial 
ider firstly the 
If the scores 
measure the 


where 


rj; = the reliability of test z, 
ri; = correlation between test 7 and test j. 
If the matrix is of rank 1 the tests in the battery may be described 
in terms of one general factor, z specifics, and z error specifics. 


Writing 

and a; = loading of the ith test in the general factor, 
е} = error variance of the ith test, 
S; — specific variance of the ith test, 


then (64) may be written 
п (n—1) 


"ES D. +> Ў аа; 
yq E M = Жс (65) 


т (я-1) 


n +> Ў аа; 


ізі ігі беу 
which may be put in the form 


A „——————_—————=— 
n-DEe+E я Va SCD Si) 
Ris i=l А £a get ( aS 2-2. (66) 


вх X Vü-si-40-s-e&) 
$21 j=l 3$ 
Formula (66) indicates, when the matrix of intercorrelations is of 
rank 1, the dependency of the battery reliability on the error and 


specific factor variances. 

Consider the case when the rank of the matrix is greater than 1. 
Let a; be the loading of test 7 in the first factor common to a tests, 
8; the loading of test 2 in the second factor common to 0 tests, and so 


on. Hence 


à а (a—1) b (b—1) r'(r-1) 
-Хае-Х Уча» XB ..... +> Ў p: p; 
i=1j=1 ім) ісі ізі ij i-1je1 if 
R= а (а-1) 170—1) 70-0 yy 
n+ Daath Ув ..... +E Dein ы, 
$=171=1 iff ізіізі іу i=1j=1 ij 
2.%....(67» 


This formula shows the relationship between the reliability of a test 
battery and the factorial composition of the tests included in it. 
Obviously from the point of view of reliability no special advantage 
need be attached to matrices of correlations of rank 1. The reliability 
of the battery depends in part on the magnitude of the intercorrela- 
tions, and although the rank of the matrix is 1, these intercorrelations 
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may be small. We may observe, however, that as a battery of tests 
approximates to the measurement of a unit trait the rank of the 
matrix of intercorrelations tends to unity. The converse does not 


hold. 


Combinatorial Reliability Analysis 


a new sub-test may not infr 


-tests of all pos- 
Thus the z sub- 


in all combinations to yield 7—1 possible reliability coefficients. 
Among these n?—1 tests and teams of tests may Ы 


cartes е found a test or 
team of k tests of reliability rj, such that rps Tii and ru s тии, where 
Есіапа Е2 m. 


calculation of the reliability of all Possible com 
and is termed combinatorial reliability analysis 

The process of combinatorial reliability analysis js analogous to 
the technique of complete tilling described by Ragnar Frisch [37] in 
connection with multiple regression problems. Frisch points out that 
in enquiries involving multiple regression the prediction attained by 
the weighted sum of z independent variates is frequently not signi- 
ficantly greater than the prediction attained by the weighted sum of 
a much smaller number of variates. Since the addition or deletion 
of independent variates can alter substantially the relative magnitude 
of the weights to be assigned to the remaining variates, it is necessary 
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to calculate regression weights for all possible combinations of the 
independent variates before our information regarding the available 
data 15 complete. In multiple regression the addition of new variates 
can never reduce the multiple correlation coefficient. If, however, the 
scores are added together without the use of weights, the correlation 
of the team of independent variates with the criterion may be reduced 
by the addition of a new variate to the team. Similarly in test 
reliability the addition of new tests to a battery can never reduce the 
reliability of the battery weighted to yield maximum reliability. И, 
however, the scores are added together without the use of weights, as 
is the common practice, the addition of new tests may reduce signi- 
ficantly the reliability of the battery. 


Figure 3.—Geometrical Representation of Battery Reliability. 


The reliability of a test consisting of a number of sub-tests may be 
visualized geometrically as follows. The correlation between two sub- 
tests may be represented as the cosine of the angle between two vec- 
tors, and the intercorrelations between z sub-tests by the cosines of 
the (n—1)/2 angles between а sheaf of z vectors. If our test has been 
given a second time to the same group of persons, we have two sheaves 
of vectors as shown in Figure 3 where the unbroken vectors represent 
the first and the broken vectors the second administration of the group 
of sub-tests. In Figure 3, for purposes of illustration, our test is pre- 
sumed to consist of four sub-tests. It will be remembered that there 
are as many dimensions as there are vectors. The reliability coeffi- 
cient is the cosine of the angle between the two centroid vectors repre- 
sented by z and гу. By deleting the pair of vectors z; and zv from our 
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diagram the two centroid vectors are pulled farther apart and the test 
reliability reduced. By removing vectors 2; and zy from our diagram 
the centroid vectors are drawn closer together, and the reliability of 
the test increased. The problem visualized geometrically is, therefore, 
to determine the particular combination of sub-tests that will bring 
the two centroid vectors as close together as possible. 

The reliabilities of all possible combinations of sub-tests may be 
readily calculated from the matrix of covariances. The covariances 
by two successive applications of % sub-tests to the same sample of 
persons may be written in the form of a pooling square, as follows: 


21 оао d E... . 8» 
2 
21 Si 7129152. . : 1995190 711515, 712.915... 711515, 
2 

anc ME. MEE 712515, 72.55 

d аа жык = Q < gh SE TiS Sy Gy 2 WS: 
zy | 00251051 70251052... 7518, SÈ Түз буу . ryySySy 
2% | ү1551.527235353. . . . . . TyySySy 55 

ЕН! FESO. e А ШЫ 52524... 192; 


The correlation between 2(149..,) and 2 
sum of all elements in the North-East quadrant of the above square 


by the square root of the product of the sum of the elements in the 
North-West and South-East quadrants, as follows: 


Driv Si S; + 22 ri S, Sj 
(+2 п) (een = AG) 


V [SI +255] [BSP 4-2X57,,5,5,] 


(ағыту) Is given by dividing the 


The term in the numerator of the above equation is the covariance; 
the two terms in the denominator are the variances of the sums of raw 
scores. à "E. 

The reliability of all possible combinations of sub-tests may be ob- 
tained by calculating values of 377;;.S; Sy +2575, Si, Х5:--2ғ;5,5; 


AG 
and 3,5? + 2X: rjj S; S; for all combinations of sub-tests, and substi- 
tuting the values obtained in formula (68). 
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The method outlined above is illustrated with reference to the fol- 
lowing set of data. The Revised Beta Examination was admin- 
istered twice to three classes of Grade IX pupils (V=95). This test is 
a revision of the United States Army Group Examination Beta. It 
consists of six sub-tests, described as follows: 


Time Maximum 


Sub-test Content (min. ^ Score 
1 Maze 15 10 
2 Digit Symbol 2 30 
3 Picture Discrimination 3 20 
4 Form Board 4 18 
5 Picture Completion 25 20 
6 Number Checking 2 25 


All the 66 different intercorrelations between the scores on two 


successive applications of the six sub-tests were calculated. These 


intercorrelations, which are given in Table XXII, are observed to be 
low, a few indeed being negative, indicating low internal consistency. 
The matrix of covariances with variances in the principal diagonal is 
given in Table XXIII. From this covariance matrix all statistics 
necessary to carry out a complete combinatorial analysis may be 
computed, The variances, covariances, and reliabilities of all possible 
combinations of sub-tests are given in columns 2, 3, 4, and 5 of Tables 
XXIVa, XXIVz and XXIVc. 

To illustrate how these figures are calculated consider the com- 
bination of sub-tests 124. The variance of the scores on the first 
application of these three sub-tests is obtained by adding together the 
elements in the N.W. quadrant of the covariance matrix of Table 
XXIII after deleting rows and columns 3, 5,and 6. This variance is 
found to be 26.227. The variance of the scores on the second appli- 
cation is calculated in similar manner from the S.E. quadrant, and is 
21.685. The covariance 18.385 is obtained in similar manner from 
the N.W. quadrant. Hence the reliability of the 124 combination, 
1771, is calculated by dividing the covariances thus obtained by the 
square root of the product of the two variances. 
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DATA OBTAINED FROM COMBINATORIAL 
REvisED BETA EXAMI 


TABLE XXIVa 


RELIABILITY ANALYSIS, 
NATION 


Combination 

of Variables Si 51, 711/515 ь riy 
1 1.824 1.243 -693 .585 
2 12.568 9.788 8.105 731 
3 6.124 5.604 3.806 -650 
4 9.611 9.368 7.547 .795 
5 5.669 4.514 4.161 .823 
6 5.652 5.318 3.694 .674 
12 16.763 11711 10.658 761 
13 9.609 7.404 5.255 боҙ 
14 12.050 12.142 9.240 764 
15 8.907 7.206 6.297 1786 
16 7.981 7.537 5.007 .645 
23 18.526 13.584 10.245 .646 
24 21.416 18.231 14.831 -751 
25 19.647 13.947 13.653 .825 
26 24,714 22.794 19.185 .808 
34 19.403 19.185 15.681 813 
35 17.100 13.543 10.759 707 
36 12.194 12.014 8.035 664 
45 17.121 17.175 18.928 .812 
46 16.068 16.267 11.884 735 
56 12.302 11.082 8.495 1728 
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DATA OBTAINED FROM COMBINATORIAL RELIABILITY ANALYSIS, 


TABLE XXIVB 


REVISED BETA EXAMINATION 


Combination 
of Variables 


Si Si rivSiSv Tir 
24.382 16.064 13.555 .685 
26.227 21.685 18.385 .771 
23.504 22.516 18.131 .788 
31.042 26.241 21.301 .746 
25.256 17.318 17.650 .879 
22.000 16.791 13.651 .710 
30.912 21.167 18.586 727 
20.974 21.398 17.059 805 
30.336 25.688 22.595 .810 
32.220 30.416 24.849 794 
29.414 25.693 22.359 1818 
16.184 14.790 10.105 .653 
31.089 27.682 21.863 745 
19.013 20.017 14.197 .728 
34.367 32.818 26.556 .791 
26.278 27.177 20.555 .769 
16.045 14.749 11.250 781 
32.774 28.202 25.375 835 
24.151 21.202 15.629 .691 
24.559 25.325 18.900 758 
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Data OBTAINED FROM СомвгтхАтов: 
КЕуізЕр BETA Ех 


Combination 


2 

of Variables Si 
1234 37.514 
1235 38.183 
1245 36.561 
1345 37.736 
2345 45.269 
1236 37.450 
1246 39.682 
1346 30.884 
2346 44.408 
1256 32.395 
1356 29.556 
2356 44.456 
1456 34.170 
2456 44.268 
3456 40.076 
12345 53.156 
12346 51.387 
12356 52.232 
12456 50.998 
13456 46.097 
23456 59.618 
123456 68.010 


TABLE XXIVc 
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AMINATION 
Si 7110815, 
Е es) 
30.252 25.611 
25.096 23.339 
30.583 27.591 
35.197 28.742 
37.114 31.856 
31.138 25.792 
37.249 30.730 
31.484 23.624 
41.920 33.562 
32.550 29.991 
25.426 19.141 
35.264 30.844 
29.808 22.656 
41.520 34.960 
39.657 30.363 
42.577 37.609 
46.907 38.492 
41.419 36.216 
47.396 40.576 
45.413 34.875 
54.045 44.758 
60.481 51.130 


IAL RELIABILITY ANALYSIS, 


` Examination of the reliability coefficients given in Tables XXIVa, 
XXIVB, and XXIVc indicates that for heterogeneous material the 
addition of one or more sub-tests to an existing sub-test or team of 
sub-tests may increase, decrease, or leave unaltered the reliability. 
Consider the following sequence: 


Sub-tests fu 
5 .823 
25 .825 
125 .879 
1256 .924 


Sub-test 5 is the most reliable of all single sub-tests, and is indeed 
more reliable than the whole test. The reliability of the whole test is 
-797. "Thus we have a situation where a test requiring only 23 minutes 
working time, and containing 20 items is more reliable than a combina- 
tion of 6 sub-tests requiring altogether 15 minutes working time, and 
containing 123 items. Thus the efficacy of measurement is impaired 
by some characteristic of the interaction of the tests in combination. 


Consider the following sequence: 


Sub-tests тп 
6 .674 
56 .728 
156 .781 
1856 .698 


The reliability of the sequence 1356 is .698 as compared with .924 for 
the sequence 1256. Thus the addition of test 3 to the 156 combination 
reduces the reliability coefficient from .731 to .698, while the addition 
of test 2 to the 156 combination increases the reliability to .924. With 
heterogeneous material of the type contained in this test no very sub- 
stantial general tendency exists for the reliability to increase with 
increase in the number of sub-tests. Indeed the addition of new sub- 
tests is found in many cases to result in reduced reliability. 
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The test material used in obtaining the above illustrative data was 
pictorial in type, and the various sub-tests differed widely in content 
with the result that the intercorrelations between sub-tests were low. 
When, however, the intercorrelations between sub-tests are high we 
should expect to find in general a marked increase in the reliability 
coefficients obtained by combinatorial analysis with increase in the 
number of sub-tests. 

To illustrate this latter situation forms A and B of the Junior 
Dominion Group Test of Intelligence were administered to 107 children 
in Grades IV and У. This test is verbal in type and consists of 5 sub- 
tests. 


No. of Time 
Sub-test Content items  (min.) 
1 Opposites 17 3 
2 Classification 17 3 
3 Analogies 17 3 
4 Arithmetic Reasoning 15 7 
5 Following Directions 15 9 


The items of forms А and В of this test were Paired for difficulty 
and discriminatory power. The two forms are regarded as attaining 
a high degree of equivalence. 

All 45 intercorrelations between the scores obtained on the five 
sub-tests of these two forms were obtained, These intercorrelations, 
which are given in Table XXV are seen to be roughly of the order .5 
and .6. Table XXVI gives the matrix of covariances, 
matrix as previously described all statistics necessar 
reliability of all possible combinations of sub-tests may be readily 
obtained. The variances, covariances, and reliabilities of all sub-tests 
and possible combinations of sub-tests are given in columns 2, 3, 4, and 
5, respectively, of Tables XXVIIA and XXVII». ж 

Here we find а strong tendency for the reliability coefficients to 
increase with increase in the number of sub-tests, The highest co- 
efficient obtained is for the 1235 combination, -9479, a coefficient which 
is slightly higher than the coefficient 9211 obtained for the whole test. 
Apparently the addition of sub-test 4, arithmetical reasoning, 
a slightly negative effect. 


From this 
y to compute the 
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TABLE ХХУПА 


DATA OBTAINED FROM COMBINATORIAL RELIABILITY ANALYSIS, 
Junior Dominion GROUP TEST OF INTELLIGENCE 


Combination 
of Variables Si Sv 715151, ru 
1 8.174 4 9.204 5.766 .665 
2 12.036 12.669 9.517 -T71 
3 16.823 14.014 11.124 .725 
4 9.613 13.960 9.714 844 
5 9.621 10.476 7.587 .756 
12 30.466 34.029 27.309 .875 
13 35.389 35.693 28.126 .791 
14 26.709 35.999 25.686 .875 
15 26.770 30.392 23.247 .815 
23 46.710 45.033 38.161 .832 
24 33.432 42.993 33.312 .879 
25 33.722 36.331 30.297 .866 
34 37.258 41.703 33.496 .850 
35 37.080 38.500 30.393 -804 
45 30.121 37.600 29.227 .869 


TABLE ХХУПв 


DATA OBTAINED FROM COMBINATORIAL RELIABILITY ANALYSIS, 
Ломов DOMINION GROUP TEST OF INTELLIGENCE 


ination 2 2 
се Si Sv "15151, rw 
123 75.532 78.869 67.189 .871 
124 60.783 77.188 61.310 .895 
134 64.746 76.217 60.704 .864 
234 78.927 89.086 74.613 .890 
125 61.127 68.404 57.983 .897 
135 64.622 70.891 57.288 .846 
235 79.032 82.705 70.622 .874 
145 56.192 70.351 55.092 .376 
245 66.004 79.819 66.018 .910 
345 68.402 79.353 64.690 .878 
1234 116.671 135.757 113.848 905 
1235 116.830 127.253 115.578 .948 
1245 102.332 124.727 103.910 .920 
1345 104.866 124.579 101.792 .891 
2345 122.136 139.922 119.001 910 
12845 168.856 197.305 168.129 1921 


% 


Whenever the scores on sub-tests, or the scores on tests in а battery, 
are added together without the use of weights the technique of com- 
binatorial reliability analysis may be employed to determine whether 
the test is functioning efficiently. Obviously if one or two sub-tests 
are more reliable than all sub-tests combined the scores should not be 
combined by simple additive methods. 

If the test has not been given a second time, consistency coefficients 
calculated either by Hoyt’s method of analysis or by the Kuder and 
Richardson formula (20) may be used in carrying out the combina- 
torial analysis. Under such circumstances coefficients are obtained 


which indicate the properties of the various parts of the test to coexist 
with one another. 


The calculation required for a com 
using consistency coefficients is somewhat simpler than the process of 
calculation already described in this chapter. If consistency coeffi- 
cients are used the elements in the upper left-hand quadrant of the 
covariance matrix are identical with the elements in the lower right- 
hand quadrant, and, with the exception of the diagonal elements, iden- 
tical with the elements in the upper right-hand quadrant. 4 

The error variance of a single test is given by 


Sq =5i (lru) 


The variances for any combination of 
Hence the consistency coefficient of any b t 


plete combinatorial analysis 


tests are directly additive. 
ests is given by 


Ти=1— 


k 
> 2 (1—4) 
22ғ;5;5; 
ij Gen 
Thus consistency coefficients for all possible со 


mbinations of test 
be readily calculated. ests may 
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СНАРТЕК УП 


THE REPORTING OF DATA RELATING TO THE RELIABILITY 
OF A TEST 


Complete and detailed information concerning the reliability of a 
test is a major factor not only in enabling the research worker to 
determine a test’s usefulness in dealing with a particular problem, but 
also to assist him to interpret the results obtained. The acquisition 
of reasonably complete data regarding a test’s reliability requires the 
planning and execution of one or more special experiments, the col- 
lection and analysis of a considerable body of data, and the presenta- 
tion, preferably in the Manual of Directions accompanying the test, 
of the findings of such analysis. The test maker will see, therefore, 
that the assessment of a test’s reliability, and consequent efficiency 
as a measuring instrument, is a matter for specific and detailed 
investigation. 

In the previous chapters we have discussed various problems relat- 
ing to reliability, and their solutions. The main findings will be sum- 
marized here, however, in order that the reader may have the necessary 
details clearly in mind when considering our suggestions regarding the 
reporting of data dealing with the reliability of a test. Not all the 
suggestions refer directly to the problem of reliability. Clearly a 
certain amount of descriptive material must be given, such as the type ` 
of test, material used, purpose of the test, etc., and a discussion of the 
type of problems in which the test may be used. In addition it must 
be clearly stated in which situations the test is to be used, for which 
grade or other unit, and the information relating to reliability must 
be given separately for each separate unit in order that the reader may 
determine the usefulness of the test in any given situation. 

Any report on the reliability of a test must give an estimate of both 
the absolute and the relative accuracy with which the test measures. 
To determine the absolute accuracy we need a knowledge of the mag- 
nitude of the errors of measurement; this can be given most conve- 
niently in the form of what is called the standard error of measurement. 
To determine the relative accuracy we require a knowledge of the 
relative magnitude of these errors of measurement in comparison with 
the magnitude of the differences between the individuals (or groups) 
tested. This can be given in the form of either the usual reliability 
coefficient or the sensitivity coefficient of the test; the latter form is 
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preferred by the authors. Other information relating to the distri- 
bution of the Scores, etc., should also be given. 

In reporting on the reliability of a test a clear distinction must be 
drawn between results which refer to the reliability in the usual sense 
and the internal consistency of the test. In the latter case the test is 


given once and a function of the scores used to give an estimate of the 
internal consistency of the test: 


reliability. In the case of reliability ej 


ion of the scores of the same 
individuals on the two trials used as a measure of reliability. It is 


consistency and reliability 


considered in the previous Paragraph. In the past the correlation 


technique has been used almost exclusively but in many cases, as has 
been shown in an earlier chapter, the analysis of variance and co- 
here is no incompatibility 

‚ and in most cases the choice 
of which of the t 


wo methods to employ will be determined by the 
nature of the problem under Consideration, In most of the problems 


connected with the reliability and internal Consistency of a test, the 
authors prefer the analysis of variance and covariance method for the 
following reasons: 

1. The arithmetical Operations involved in the ana 


5 1 lysis appear to 
be simpler and easier to carry out. 


‚апа the necessary tests of 


lead to the drawing of unwarranted conclusions, 


For these reasons, therefore, the authors suggest that the analysis of 
variance and covariance method should be used in analysing the results 
unless the nature of the problem is such that the use of the correlation 
technique is indicated, e.g. in the combinatorial reliability analysis. 
The analysis of variance and covariance method is, of course, а general 
one and is applicable to any or all of these problems, 

There is, finally, the question of the analysis appropriate for special 
kinds of tests. For the omnibus type of test the usual kind of analysis 
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can be employed but in battery tests, on the other hand, a series of 
analyses should be given. For a battery test we need to know the 
reliability and internal consistency of each sub-test, the relationship 
between the various sub-tests, and, in addition, the reliability and 
internal consistency of all possible combinations of sub-tests. For 
this type of test, therefore, it is suggested that a combinatorial relia- 
bility analysis should be reported and also, wherever practicable, a 
study made of the question of weighting the sub-tests to give maximum 
reliability of the battery. 

Below we have prepared an outline of a report on information which 
in our opinion should be presented in the Manual of Directions 
accompanying a test. If any reader feels that a report of the type 
suggested will involve a large amount of unnecessary work, or if he is - 
of the opinion that the details to be reported are more of academic 
than of practical interest, he is asked to carry out a simple enquiry. 
Let him select a problem in some educational field involving the use 
of tests of types which have already been constructed, and let him 
then endeavour to determine from the available information which 
tests are applicable in solving his particular problem, and with what 
degree of accuracy he may expect to measure under the conditions 
specified by his problem. In the majority of cases his experience will 
no doubt be similar to ours. Our ordinary work involves the use of 
many tests, and also the furnishing of advice to teachers and others 
regarding the most suitable test or tests to use in the solution of a 
particular problem. Rarely have we been able to answer all relevant 
questions by reference to the published information. Usually we find 
it necessary either to plan and carry out a special experiment of our 
own, or recommend for use a test or tests by a particular author whose 
work we know to be in general satisfactory. We suggest, therefore, 
that the information outlined in the following summary be collected 
and published by the author of each test. The reporting of a single 
reliability coefficient without even a statement of the number of per- 
sons tested or a description of the group tested contributes little or 
nothing to our knowledge. 

No mention has been made in the report given below of reliability 
coefficients calculated by the split-half or the odd-even method. We 
suggest that this method of estimating the reliability or the internal 
consistency of tests be no longer employed. Other methods are avail- 
able which do not involve the arbitrary division of the test into parts, 
and yield substantially more information. 
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SUGGESTED TEST REPORT (IN MANUAL ОЕ DIRECTIONS) 


A. General Statement 
1. Purpose of test: 


2. Situations in which it is recommended for use: 
e.g. Grade or grades, ages, etc., of pupils. 


3. Time taken to administer: 
Working time for test proper and total time. 


4. Type of test: 
Battery or omnibus, 
Self-administering or otherwise, 
Method of pupil-response. 

5. Number of items: 
By sub-tests and total, 
Method of scoring, 
Possible scores on sub-tests and total. 


6. Test material: 
Description of type of material used in each sub-test. 
7. Norms: 


Type of norms given and ho 
preted, 

Description of population to 

Description of method used i 


W these are to be inter- 


Which norms apply, 
n calculating norms, 


B. Data Relating to Study of Reliability 
1. Number of pupils tested: 
By age, sex, grade and other units considered, 
2. Distribution of scores: 


Give actual distribution of scores 
standard deviation) for each grade о 
sidered, and for each sub-test and th 
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(plus mean and 
r other unit con- 
e whole test, 


3. Reliability: 
Separately for each grade or sampling unit considered. 


(a) Internal consistency 
Suggest use of method proposed by Hoyt 
[51]. : 

(b) Test-retest 
Suggest use of method proposed by Jackson 
[52]. 

(c) Comparable forms 
Suggest use of method proposed by Jackson 
[52]. 


(Note:—In addition the usual correlation coefficients may 
be given although they add nothing to the information 
obtained by the use of the above methods.) 


4. Battery tests: 
In the case of a battery of tests or sub-tests, all the 


information under 3 above is to be given for each sub- 
test. In addition, all the intercorrelations of the sub- 
tests and a combinatorial reliability analysis are to be 


reported. 


5. Standard error of an individual score: 
To be reported for each sub-test and total test and for 
each grade or other unit employed. 


6. Intelligence tests: 
Report the above information separately for raw 
scores, mental age scores, and intelligence quotients 
if these different kinds of scores yield different results. 


7. Comparison with other tests: 
Report correlation coefficients and other relevant data. 
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APPENDIX А 
NOTE ON THE ESTIMATION OF RELIABILITY COEFFICIENTS 


In order that the results given in this Bulletin might be directly comparable 
with those given elsewhere, the general formula for the Pearson product- 
moment correlation coefficient has been used in calculating the reliability 
coefficients. As was mentioned earlier, however, the use of this particular 
formula is not always justified in problems such as these. The sample value 
is used as an estimate of the population value, and hence the particular method 
to be used in calculating the value of our estimate will be determined by the 
conditions which are assumed to exist in the population from which we are 
sampling. If the conditions are changed, then obviously we must make an 
appropriate change in the method used in estimating the population values. 
This section is devoted to a discussion of these problems and to the development 
of the methods to be used in the estimation of reliability coefficients under 
different conditions. 

It is necessary, in the first place, to point out that underlying the correlation 
method is the fundamental assumption that the variables which we are cor- 
relating are normally distributed and that the regressions are linear in form. 
It follows that when we use this method we assume, implicitly or explicitly, 
that these conditions are satisfied. Fortunately, in much of the educational 
work of this kind, these conditions are roughly satisfied. Our tests are so 
constructed that, if they are used in the appropriate situations, few individuals 
make very low or very high scores and the general distribution may be ade- 
quately represented by a normal curve. Similarly, the linearity of regression 
condition is generally satisfied. If the test is too easy or too difficult for the 
however, these conditions are not satisfied, but it is assumed that re- 
search workers will be careful to use tests only in the situations for which they 
were designed. If the conditions are not satisfied, then it is generally impos- 
sible to interpret our statistical results but this limitation is, of course, well- 


pupils, 


known. Ж 
In the discussion which follows, we shall assume that the variables are 


normally distributed and the regressions are linear. It must not be concluded, 
however, that because of this the results deduced here are less valid than the 
ones more widely used. They apply, in fact, to all situations in which cor- 
ation methods may be used. 3 

Let us consider the general case in which we have two sets of scores on the 
same test for the same individuals. As far as the general discussion is con- 
cerned, it is immaterial whether they refer to results of the test-retest, alter- 
native forms or split-half experimental method. If we assume that in the 
population from which we are sampling the variables are normally distributed, 
then the form of the distribution will, in general, be determined by the values 
of the two means, the two standard deviations and the correlation coefficient. 
This is the general case to which the usual Pearson product-moment formula 
for estimating the correlation coefficient applies, and it will be considered first, 
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rel 


There are, however, two other cases їп which, it is maintained, the conditions 
agree more closely with those which exist in the problem of estimating the 
reliability coefficients. The second case to be considered is one in which the 
two standard deviations may be assumed to be equal: in this case the form of 
the distribution of the variables in the population from which we are sampling 
will be determined by the values of the two means, the common standard 
deviation and the correlation coefficient. These conditions are assumed to 
hold in the test-retest and alternative forms experimental methods, and in 
analysing these experimental results we must use the appropriate estimates of 
the population values. The third case to be considered is the one in which it 
may be assumed that the two means are equal, and the two standard deviations 
ribution of the variables in the popu- 
lation from which we are sampling will be determined by the values of the 
Common mean, the common standard deviation and the correlation coefficient. 


d in the split-half experimental method 
ental results we must use the appropriate 


In the following discussion, each of these cases will be considered separately. 
The "maximum likelihood" method will be used in each case to determine the 
This well-known method is based on 
use as estimates those values which 


naxi ‹ d event; hence the name, "maximum 
likelihood" method. То obtain the estimates we differentiate the probability 


function with respect to the parameters in which we are interested, set the 
resulting equations equal to Zero, and solve. This process gives us what are 


termed the maximum likelihood estimates of the parameters or population 
values. 


Case 1. General Case 

Denote by X ; and У; the scores obtained by the i-th individual on the first 
and second trials of the test, respectively; Бу M; and M, the means, by c, and 
ву the standard deviations, and by p the correlation coefficient in the sampled 
population of X and Y. Тһе subscripts, x or У, denote the variables to which 
these parameters refer. If the variables are normally distributed, then if we 
denote by Р(Х), Р(У;) and ХХ; Y;) the Probability distributions of X n Из) 
and X; and Y;, respectively, it is known that 
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where Z denotes summation for all N values of 2. 

As we wish to estimate Му, My, Cx, су and p, we shall have to take the 
partial derivatives of p(X1,...,Xw, Y1,-- + Ум), which we may denote by р for 
short, with respect to each of these. It is more convenient, however, to work 
with the natural logarithm of p: this will not affect the results, of course, as p 


will be a maximum if log p is a maximum. We have 
log p= —N log 2r —N log ox —N log oy 
1 x { (X:—Mz)? 


= P log (1—2) = Оу заве 
_ 200: Ма) (Vi=My) | Semn mi. (74) 
040y о? 
Differentiating Іор p with respect to Му, we obtain: 
ôlog $ ene 2 Z(X;-M.;) _ __2р Z(Y;—My) 
Мм, “Geek He) ee NL pasi (15) 


which reduces to 
оуХ(Х:-— М.) =poxz(Vi-My) —........ (76) 


Similarly, differentiating log Р with respect to M,, setting the equation equal 
to zero and simplifying, we obtain: 


в, (У;-Му) =poy2(Xi-—Mz) ...... (77) 
Assuming 0:70, 070, p71, we may solve (76) and (77) to obtain 
7 ZX; 
M, = N 
me Эй шкы а К (78) 
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Differentiating log р partially with respect to ту, øy, ріп turn, setting the 
resulting equations equal to zero, solving and substituting the values given in 


(78) for М, and My, we have finally: 
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The maximum likelihood estimates of the five parameters, Mz, My, oz, oy, and 
р, are given, therefore, in equations (78), (79) and (80). These require no ex- 
planation as they are, of course, the estimates generally used. 


Case 2. Equal Standard Deviations 


In this case, we assume that the standard deviations of the two distributions 
are equal, i.e. 
Ox =Cy =o 


where c denotes the value of the standard deviation common to the two dis- 
tributions. The simultaneous probability distribution, р, of all the № values 
of X; and У; will be 
E то аум), (нему) 
Inorr/1—pi) ° 
Using log р; differentiating with respect to Му, М}, c, p, separately; setting the 
resulting equations equal to zero, and solving as in the previous case, we find: 
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The values given in equation (83) are the maximu 
the four parameters Mz, My, о and р. It follows, therefore, that in problems 
in which the assumption of a common standard deviation is satisfied, these are 
the appropriate estimates of the population values. In analysing the data 
obtained by using the test-retest and alternative forms experimental methods 
we should, therefore, use the formulae given in (83) in calculating the estimates 
of the parameters in which we are interested, 


т likelihood estimates of 


Case 3. Equal Means and Equal Standard Deviations 
In this case we assume that the means and the 


= 2 standard deviations of the 
two distributions are equal, i.e. 


М,-М,-М 
бұ- су =o 


where M denotes the value of the mean, and c the value of the 
tion, common to the two distributions. Proceeding as before, 
" 
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Standard devia- 
we obtain finally: 
-. 
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The values given in equation (85) are the maximum likelihood estimates of the 
three parameters M, с, p. Іп problems in which we may assume a common 
mean and standard deviation for the two distributions, we should, therefore, 
use these formulae in calculating the estimates of the parameters. 16 follows, 
therefore, that in analysing the data obtained by the use of the split-half experi- 
mental method, we should use the formulae given in equation (85) in calculating 


the estimates of the population values. 


Examples 
(1) The values given below refer to the scores made by 29 pupils on 
two forms of an intelligence test. To save space, only the necessary 
totals are given: 
N- 29 
УХ; = 633 
ХҮ; = 757 
EX? = 16537 1 
DY? = 23685 
ZX;Y; = 19269 ba 
To test whether or not the standard deviations may be assumed to be 
equal, we calculate 
2_ ZX) 
pee 
р = 144 
š y EE. 
a N 


and refer to Snedecor's tables of F with degrees of fr LY 

We find that F is less than the 5% point den in №. F Ра А 
conclude that the assumption of a common standard evasion is 
О е the formu 

Using the formulae given in equation (8. 

estimate of the reliability ЕЕ is es en seyn] 
formula given in equation (80), the formula usually used, we sho id 
have had r =0.840 as our estimate of the reliability coefficient Th 

difference between the estimates is not large, and in some cases is езеп 
smaller, but in cases of this kind we must, to be consistent, use те 
best available estimate of the reliability coefficient. j 4 
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(2) The values given below refer to the scores made by 29 pupils on 
the odd and even items of a test. The totals are: 


N= 29 
ХХ; = 327 
ЗУ: = 316 
2X; = 3731 
ZY} = 3478 

ХУ; = 3591 


We find, making the appropriate tests, that neither the means nor me 
standard deviations of the two distributions are significantly different; 


we may; therefore, assume a common mean and a common standard 
deviation. 


Using the formula given in equation (85), we find r=0.665 as the estimate 
of the reliability coefficient of half the test. If we use the formula given in 
equation (80), we find r=0.714 as the estimate of the reliability coefficient of 
half the test. The difference here is considerable, and clearly in cases such as 


these we must be careful to use the appropriate formula in calculating our 
estimates. 


It should be pointed out that in all 
assumptions of equal standard deviatio 
they are not, and in these cases it is ne 


Cases we must test whether or not our 
ns, or means, are justified. Sometimes 


cessary to find the reason. In using the 
Split-half method, for example, we found an interesting case in which the mean 


score on the odd items was not €qual to the mean score on the even items. 
When we calculated the difficulties of the items, we found that they were not 
arranged in order of difficulty; many of the odd items were considerably harder 


than the corresponding even items. Obviously, for such a test one cannot use 
the split-half method in determining the reliability. 
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APPENDIX В 


NOTE ON TESTS OF CERTAIN HYPOTHESES RELATING TO THE 
PROBLEM OF MEASURING THE SENSITIVITY OF A 
MENTAL TEST 


Jackson [52] has considered the problem of measurin 
the reliability or consistency of mental tests and suggested a measure, ub 
the sensitivity of the test, based on the concept of the relative accuracy of the 
measurements. The present section is concerned with the development of 
tests of certain hypotheses closely related to this problem. 

Underlying the solution of the above problem is the assumption that the 
score of an individual on the s-th trial of a test may be considered as the sum 


of certain factors or components, ie 

Yau-sATBitCOrk£a 00000000 (86) 
.. п; n represents the number of pupil: i 
score of the /-th individual on the s-th trial of the test. ДЫ SNE] JE 
measure of the common ability of the group. It is assumed to be constant, 
and is defined as the arithmetic mean of the “true” effects for all trials and 
individuals. В, іѕ considered аз a measure of the trial effect, and is also assumed 
to be a constant. С, is considered as a measure of the individual effect, i.e. a 
measure of some capacity or ability of the /-th individual. zs is a E of 
the residual or error effect, i.e. a measure of the errors of measuring by means of 
the test. It is assumed, further, that in the population from which we are 
sampling, Ct is normally distributed about zero with constant standard devia- 
tion, де, constant for all trials, and Zat is normally distributed about zero with 
constant standard deviation, c, constant for all trials and individuals. 

The sensitivity of a test, denoted by y, is defined as the ratio of these two 


standard deviations, 1-е Е 
с 
(87) 


Uem bons: 


In a previous paper, 


where 5-1, 2; #=1, 2,.. 


Since y expresses the differences between the individuals tested in terms of the 
ment of the test, it may be called a measure of the relative 


errors of measure! É ; 
accuracy of the test. For this reason, the interpretation of the sensitivity is 
le and easy to understand. 


particularly simple : A 
The problems discussed in the above paper are as follows: 
e if there is a significant trial effect; 


(1) to determin Jm | 

(2) to determine 1 the mental test actually measures the capacit 
individuals tested; $ pacity of the 

(3) to estimate the trial effect if it exists; 

(4) to estimate the sensitivity of the mental test if it is found that the test 


he capacity of the individuals tested. 


measures t 
The statistical hypotheses to be tested in the solution of problems (1) and 
(2) are Hi:B, = 0 
Hes; = 0.0750) .... (88) 
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The procedure to be followed in testing these hypotheses, together with the 
solution for problems (3) and (4), has been given elsewhere in this Bulletin 
(Chapter III) and will not be repeated here. In using this method, however, 
it has been found that there is another problem, similar to (2) in many respects, 
which it is convenient to consider before we proceed to the estimation of the 
sensitivity of the test. It arises only when we reject H, : C,=0 (or y =0) and 
therefore conclude that the test does measure the capacity of the individuals. 
It may happen that we find y>0, i.e. we reject Hz, but that the test is not sen- 
sitive enough to give results which could be considered as satisfactory. If 
Y =1, for example, then о: =ç and we would conclude that the error effect might 
be as important as the individual effect (i.e, the capacity of the individual) in 
determining the actual score obtained by the individual. We should, in fact, 
make an error as great or greater than ос units in about 32% of cases in using 
the scores as estimates of the true capacity of the individuals. The problem of 
selecting a lower limit for the sensitivity of tests, say ү =2, for example, in order 
that tests which do not reach this standard may be automatically eliminated as 
unsatisfactory, is a psychological rather than a statistical problem. It is clear 
that the particular value chosen as the lower limit of sensitivity will be deter- 
mined by the conditions of the experiment and the use which is to be made of 
the results. A test may give satisfactory results in one situation but compara- 
tively useless results in another. For these reasons the particular numerical 
value of y to be used has not been specified in the theoretical part of this paper, 
but the general case y=K, where K is some constant greater than zero, has 
been considered. 

The statistical hypothesis to be tested may be stated in the form 

Нз:ү = К (89) 


where К is some specified constant, always greater than zero. We should, of 
course, always test the hypothesis Н» Dy =0 first; if we accept H, there will be 
no need to carry the analysis further. The purpose of this section, therefore, 
is to consider the problem, 


(5) to determine if the mental test is sensitive 


О enough to yield 
satisfactory results, 


We shall follow the method outlined in Chapter III and use the theory of testing 
statistical hypotheses devised and developed by Neyman and Pearson [75, 76]. 

The set of hypotheses alternative to Из will include all hypotheses specifying 
values of y greater than zero but not equal to К. If we reject the hypothesis 
Нз:у=К, however, we shall generally wish to know whether the true value of y 
is greater than orlessthan К. If appears to be less than K, we would conclude 
that the test is not sensitive enough to yield satisfactory results; if, on the other 
hand, y appears to be greater than K we would conclude that the test is sensitive 
enough (more sensitive than is necessary, in fact) to yield satisfactory results 
although in both cases we would reject the hypothesis H: :y=K. It seems 
necessary, therefore, to distinguish between the two sets of hypotheses alter- 
native to Нз, namely: 


(Case 1) the set of alternative hypotheses specifying values of y less than 
К, i.e. На: y = К! where КІ is some constant greater than zero but 
less than K. 
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(Case 2) the set of alternative hypotheses specifying values of y greater than 
K, i.e. Нз!: y =K" where КИ is some constant greater than К. 
It will be necessary, also, to develop two separate tests of the hypothesis 


Нз: y = K, one for each set of alternative hypotheses. 

We shall consider the case of two trials (s = 1, 2 only) as the data are generally 
collected in this form. Since A was considered as a constant and defined as 
the arithmetic mean of the effects for all trials and individuals, it follows that 


ZG; = 0 
ва (90) 
ог В) --В;--В, say 


where В is now the measure of the trial effect. We may rewrite (86) in the form 


Yu=A-B+Ci+su (91) 
VorsA+B+Citen f 0077 


From the above assumptions it is easily seen that 
(а) Yu is normally distributed about a mean А — В with constant standard 


deviation equal to V/o?+0? and 
(b) Ух is normally distributed about a mean A+B with the same constant 
standard deviation. 


Denoting by p(Ys:) the elementary probability distribution of Ум, we have 
(¥ yy- A+B)? 


1 "BIELLA aa 
Yi) = ——e 261) 4, 4L. 92 
(У) Vas Verte (92) 
_ (Фы-А-в) 
@ ЧЛ, 0 ЕЖЕ (93) 


1 
(Уз) = Vin Vette: Varta? 


It follows that the simultaneous probability distribution of all the Y's is 


е 1 Е zx $ ((¥u-A+B)* 
di Tur (eevee) eil 20—-p) (à ( о%+ог 


(Үн-А+В)(Үз-А-В) | (Yu —A – В)? 
—2р ata? Tue ее (94) 
where p is the correlation coefficient in the sampled population of Y's. 

Define ВИЙ by айе К (95) 
and consider the simultaneous distribution of the 27 other variables defined by 
the equations: ums 

ОҒАДА - 1-00», - 2 S ӨЗ (96) 
Making the appropriate transformation, and substituting 
ү? 
ЙТ RT te 50, 491% mech crite (97) 
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we find that z and v, are distributed independently and 


i a pal? u-u) n СЕН 
К) сата), И ы (98) 
2/266. 
Te] —vy-n(r-2By 
1 An, T gal? 96-28] 
=) ЖҰТ” 7. (99) 
TO 
where 
- 1 
u=—Zu, | 
"To, 
| ША. э лу (100) 
u= u | 
à nit 
Define, also, 
2 -— a 
"5,-Х { m 
t 
YU unes (101) 
2 -)2 
"5-2 { nak 
. t 
It may be shown that if the value of 0 is specified, then the statistics 
Ti =u 
Т, =з 
Ил 2%. (102) 


5те 
T= 3 tS 


form a sufficient set for the parameters А,Вапдо. The hypothesis to be tested, 
Hi „specifies the value yo = K of y, or the value 0,—1--2K* of0. The two sets of 
alternative hypotheses, H3 and H,", Specify the values of 0 less than б; and 
greater than ĝo, respectively. 


Case 1. Test of the Hypothesis Hs: ү-К for the set 
Ныү<К 

Тһе hypothesis H; does not specify the values of three parameters, 4, B, 
and c, and is, therefore, a composite hypothesis, The Critical region for testing 
Нз must be "similar" [74] to the sample space with respect to А, B, and c, that 
is to say, such that the probability of its containing the sample E be equal to 
some chosen value e, independent of the values that those parameters may 
possess. The general method of constructing such regions [77] if a set of suffi- 
cient statistics exists, consists of the following. 

Denote by W the whole sample space and by the single letter Т the whole 
set of statistics sufficient for А, B, and о. Further, let W(T) be the locus of 
points where T is constant and let w(T) denote a part of W(T), 
all possible values to the statistics forming the sufficient set Тап 
regions w(T) into one region w. If the region w(T) is chosen with 
that the probability that E fall in w(T), calculated on the assu. 
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of Alternative Hypotheses 


Give, in turn, 
d combine the 
the restriction 
mption that it 


fall in W(T), be equal to Е, so that P{ Eew(T)/EeW(T)} =e, then the region w 


will be similar to the sample space with respect to A, B, and c. 
If it is desired that this region w be most powerful with respect to some 


particular alternative hypothesis Hy it is necessary and sufficient that the 
regions w(T) be chosen on each W(T) to satisfy the inequality: 


ФР... Ea Hj) E E(Ty, Тә, Т (Уш... V/A ........ (103) 


where (Tn, Т», Тз) is a function of Т, Т», and Ту only. 
It is easy to see that on each surface W(T) the inequality (103) reduces to 


SESS eT) к мл (104) 


where (Ty, T», Ta) is again a function of T, Т», and Т; only and is to be deter- 
mined to satisfy the condition P{ Eew(T)/EeW(T)} =e. Since the assumption 
that H; is true and Т; has a fixed value implies that 52 cannot exceed 673, this. 


last condition reduces to 
kı E BoT: % Р 
[ре Ti, Ty SB) dS - ns LIES ID dS, Ве M (105) 
0 


where ĝo = 1 +2K? and e denotes the probability of errors of the first kind which 


we fix beforehand. Е ) ; 
The probability distribution (Su, Т, Т», Тз) may easily be obtained from 


the distribution given in equations (98) and (99). Substituting the value of 
PC Ть Т», T3) in equation (105) and simplifying, we find 


ЕЕ SNE o RB жа. N 
n (Su) ° Є ar 48-4) G9 (x) 2 as. 

MN (106) 
Let 51 = Tas \ та 

еп а= ета. 77 И 

substituting in (106) and simplifying, we obtain 
т 
г OTs „3 n-3 

(n-1) т-ны. > (108) 


We can find za =z say, from the Tables of the Incomplete Beta-Function 
043 


We reject the hypothesis to be tested when 


[80] for any given value of е. 
SiShesAT. 00000022... (109) 


2 
But з= rà 4-52, and substituting this value in equation (109) we find that we 
o 


reject the hypothesis to be tested when 
2 
5 < 
5% 
Or, alternatively, we may use an adaptation of the s-test of В. A. Fisher; i.e. 
Calculate the value 
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52 
E “ 
23= loge tsi ARA (111) 


Һ=һ=п-1 
and reject the hypothesis to be tested, Hs : у= K (or =00), when 2з is less than 


either 

23(5%) = 3 loge (1 +2K2) — 25% } 

23(1%) = š loge (1+2K*) — 21% 
where 25% and 21% are the 5% and 1% points, respectively, given in Fisher's 
tables of the distribution of z [32]. 


As this test does not depend on the alternative value of 0, it is uniformly most 
powerful with respect to all the alternatives of the class considered. 


Case 2. Test of the Hypothesis Hs: y=K for the set of Alternative Hypotheses 
Hy':y>K 
Proceeding as in Case 1, we find that a common best critical region exists and 
is defined by 


b(Yu...., Yon/H3') = Q (Ta, To Ts) p (Yin «+, Yon/Hs) 


where Hi! їз а composite hypothesis alternative to Нз and specifying some value 
of 0 greater than 4. 

On the surface for which the sufficient set of statistics of equation (102) is 
constant, equation (113) reduces to 


SEO 11^ IN (114) 
where Qi (Ті, T», Ts), or Qi for short, is chosen so as to satisfy 
Ts oTa 
of (Si Т, Ta, Ts) dS% = Г (Se) Te n TASA vas wee (115) 


where ĝo =1+2K? and е denotes the probability of errors of the first kind which 
we fix beforehand. 


Substituting the value of 2(52, Ті, T», Ts) of the previous section in equation 
(115) and simplifying, we have 


Ta  n—3 2N n—3 OT: | n—3 52 п-3 
а (o9 o nas 
qute (116) 


making the transformation given in equation (107), we find that equation (116) 
reduces to 


Ген) ЕЕ Se „ШШ (117) 
(г-:)°] o 
Š 2 Dy 
1 
We сап finds т. =h, say, from the Tables of the Incomplete Beta-Function 
for any value of є. We reject the hypothesis to be tested when 


Six 01-107; 
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БА š 
But Ts = T 4-52, and substituting this value in (118) we find that we reject 


0 
the hypothesis to be tested, Hs: y =K, when 


52 h 
= >0— 
si bre m aod (119) 


Or, alternatively, we may use an adaptation of the z-test of R. A. Fisher; 


i.e. calculate the value 
23 š lo Su | 7 
че {$} oae (120) 


E, | 


and reject the hypothesis to be tested, Hs: y =K, when 23 is greater than either 


21059) =} loge (1++2К?)+%% l (121) 


slag) 73 loge (1-29) +21% | 
where 25% and 21% are the 5% and 1% points, respectively, given in Fisher’s 
tables of the distribution of 2. 


Examples 
The following results refer to a test in French Reading for Grade X, pre- 


pared by our Department and given to two classes of pupils. 


TABLE XXVIII 


RESULTS OF ANALYSIS OF ScoRES ON 
FRENCH READING TEST 


1st Class 2nd Class 
n 35 39 
ns? 5288.2000 3704.8718 
ns? 166.0857 553.8462 
Estimate 
of y 3.93 1.69 
5% 1.730 0.950 
Z3— 3 loge 52 
From Fisher's tables of z, we find 
(a) for Class 1 
ф=Ё= 34 
2,5, =0.281 1% 70.401 
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(b) for Class 2 
Л-Һһ-38 
55% =0.264 51% =0.376 


Let us assume that we have decided the test will yield satisfactory results if 

‘its sensitivity is not less than the arbitrary standard of yo=2; we know that if 

the test is as sensitive as this, then in using the scores as estimates of the true 

ability of the individuals (in French Reading) we shall make an error as great 

or greater than c; units by chance alone in less than 5% of cases. The hypo- 
thesis we wish to test, therefore, is Hs: y =2 (i.e. K 2). 

Consider, first, the test of the hypothesis Ho: С,=0, ie. y=0. It will be 
seen from the values of z; given in Table XXVIII and the values of the 5% and 
1% points given above, that we would reject this hypothesis in both cases. We 
conclude, therefore, that the test does measure the ability of the individuals, 
and we may proceed to the test of the hypothesis Hs: y0=2 (i.e. K =2). 

It will be seen from the values given as estimates of y in Table XXVIII, 
that for Class 1 we must consider the set of alternative hypotheses specify- 
ing values of * greater than 2, and for Class 2 we must consider the set of 
alternative hypotheses specifying values of Y less than 2. The tests for the 
two classes are considered separately in the following analysis. 


(a) Class 1. Test of the Hypothesis H3: Yo=2 
Since the set of hypotheses alternative to Нз which we consider specify 
values of y greater than yo =2, we use the results givenabove under Case 2. We 
find 
23(5%) = loge (1 +2K*) +25% 
=1.099--0.281 =1.380 


23029 =š loge (1+2?) + 21% 
= 1.099 4-0.401 = 1.500 
We find that the observed value of zs; = 1.730 is greater than the 1% point so we 
reject the hypothesis tested, H3:yo=2. We conclude that the test gives results 


which are satisfactory since it proves to have a sensitivity which is significantly 
higher than our selected standard. . 


(b) Class 2. Test of the Hypothesis Нз: үџ=2. 


Since the set of hypotheses alternative to H; which we consider specify 
values of y less than yo =2, we use the results given above under Case 1. We 
find 

23(0%) = 3 loge (1-F2K?) —25% 
= 1.099 — 0.264 =0.835 
2319) 73 loge (1+2K?) — 21% 
=1.099 —0.376 =0.723 
We see that the observed value of z; =0.950 is greater than the 
accept the hypothesis tested, H3: yo — 2. 

Our conclusion, therefore, should be that there is no evidence against the 

hypothesis tested, Нҙ: %=2 and this may be considered as a justification for 
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5%, point so we 


applying the test. However, there is a considerable difference between the 
situation concerning the classes 1 and 2. 

It should be noted that the tests of Hs: y = K are not very powerful in the 
case of small samples. The distributions on which our tests of significance are 
based overlap to a marked extent for samples in which f: and f: are less than 100. 
It follows that the probability of errors of the second kind, i.e. the probability 
of our accepting the hypothesis tested when it is, in fact, false, will be very large 
for small samples, if we consider alternative hypotheses which specify values 
of y not very different from the value, yo = K, specified by the hypothesis to 
be tested. The tests given above, however, are the best possible in the sense 
that the probability of errors of the first kind is controlled at a fixed level and 
the probability of errors of the second kind is reduced to as low a level as pos- 
sible for the type of hypotheses in which we are interested. 

All this is relevant from the point of view of interpreting the above numerical 
results. In both cases of class 1 and class 2 the conclusion is favourable to the 
test, but because of chance variation it may, in fact, be wrong. 1 it is wrong 
for class 1, it would mean that we committed an error of the first kind. But 
the probability of this, in our case, does not exceed 1 per cent. On the other 
hand, if the presumption that the test is satisfactory and үо=2 is wrong for 
class 2, then it would mean that in our statistical analysis we committed an 
error of the second kind. Because of what has been said above about the power 
of the test, this last circumstance is not so unlikely. 


APPENDIX С 


NOTE ON THE RELATIONSHIP BETWEEN RELIABILITY 
COEFFICIENTS CALCULATED FROM MENTAL 
AGE AND I.Q. SCORES 


When we spéak of the reliability of a test, say for a particular school grade, 
we imply that there is one and only one reliability coefficient for a particular 
test. Ignoring the effect of the variability in the population sampled, there 
are at least two and in some cases three reliability coefficients which we may 
calculate and use. Our estimate of the reliability may refer to (1) the raw 
scores, (2) the mental age scores or (3) the I.Q. scores. Since for most tests 
there is a simple linear relationship between the raw scores and the mental age 
equivalents, the first two estimates will generally be the same. This is not 
always the case, however, as for some tests there is not such а simple relationship 
between raw scores and mental age scores. This difference is not as important 
as the difference between the second and third estimates, and hence the present 
note is concerned only with the relationship between the latter two. 

Ina recent paper [54], Jackson has discussed the general relationship between 
mental age and I.Q. scores so there is no need to repeat the arguments here. It 
was shown that the correlation between 1.0). scores and chronological age will 
generally be negative, and that the correlation between two sets of mental age 
scores will not be equal, except under certain special conditions, to the correlation 
between the corresponding 1.0. scores unless the influence of chronological age 
is removed (by the usual partial correlation technique) from both coefficients. 

Since a reliability coefficient is only a special form of a correlation coefficient, 
these results may be applied directly to the present problem. When the popu- 
lation from which we are sampling is a school grade, the correlation between the 
1.О. scores and chronological age will generally be negative, and very often 
large. For mental age scores and chronological age, on the other hand, the 
correlation will generally be positive but low, and in many cases will not be 
significantly different from zero. It follows, therefore, that the reliability co- 
efficients calculated from mental age and I.Q. scores will not necessarily be the 
same and we cannot speak simply of the reliability of a test. 

The reader may immediately raise the question as to which reliability co- 
efficient should be given. Аза matter of fact there is no general answer to this 
question because some workers may wish to use mental age scores and others 
Т.О. scores. To ensure general satisfaction, all the coefficients should be given 
in order that a worker may use the value appropriate to his particular problem. 

Some examples of the kind of results which may be obtained are given below. 
The comparable forms estimates of reliability are shown in Table XXIX for 
each four grades and the total for all grades, and the correlation between men- 
tal age, 1.0. scores and chronological age for Grade 5 only in Table XXX. 
Although the samples are small, these results are typical. 
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t = we, * 


TABLE XXIX 


CoMPARISON OF RELIABILITY COEFFICIENTS 
CALCULATED FROM MENTAL AGE AND 1.0. 
Scores (By GRADES) 


ante өтесе Reliability Coefficients 
Pupils Mental Age 1:0; 

7 38 0.865 0.958 

6 42 0.907 0.974 

5 40 0.830 0.937 

4 38 0.888 0.961 
Total 158 0.934 0.958 

TABLE XXX 


RELATIONSHIP BETWEEN MENTAL AGE, I.Q. ScoRES AND 
$ CHRONOLOGICAL AGE (GRADE 5 ONLY, 40 CASES) 


Correlation with Chron. Age 


Form used 
Mental Age 1.Q. 
Form A +0.036 —0.766 
Form B —0.056 —0.815 


eliability coefficients for chronological age, we have 
a value of 0.834 for mental age and 0.839 for 1.Q. scores. 

When we sample from a population composed of several grades, the difference 
between the mental age and 1.0. reliability coefficients tends to disappear. In 
this case, of course, we increase greatly the variability of both mental and 


chronological age but the variability of the 1.О. scores is relatively unchanged. 
The values shown in the last row of Table XXIX illustrate this point; the 
reliability coefficient based on I.Q. scores is still higher but the difference is not 


as great as in the other cases. 


If we correct the Grade 5 r 
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APPENDIX D 


NOTE ON THE RELATIONSHIP BETWEEN RELIABILITY AND SAMPLING 
WITHOUT REPLACEMENT 


In the succeeding discussion certain reliability formulae are developed from 
probability considerations used in the theory of sampling without replacement. 
Consider, firstly, an urn containing N balls of which Np are white and № 
are black. From this urn we may draw a ball times without replacement, and 
the variance of blacks or of whites in repeated sampling is given by the formula 


-1 
ot=npg— "0200 cect eee (122) 
where р = proportion of white balls in population sampled; 
4 = proportion of black balls in population sampled; 
n = number drawn; 
N = number of balls in population. 


We assume in the above formula that the balls are unbiased. 

Now an intelligence test constructed of N items, N/2 odd and N/2 even, 
may be likened to an urn containing N balls, one half of which are white and 
the other half black, the variance of white or black balls in repeated sampling 
without replacement being given by the formula 


_ n n(1—l) 
cría — = са (123) 


In splitting a test by calculating scores on the odd and even items we may 
argue that, if the test is split without bias, р=9=1. Thusifa large number of 
persons make a score X on a test Z of № items, the variance of scores on the odd 


itéms, which here is presumed equal to the variance of scores on the even items, 
is given by 


X X(X-1 
ча Eom "ar (124) 


Since all the balls in the urn to which we likened our test were presumed to 
exist without bias and to be equally probable, it is clear that formula (124) 
makes the assumption that all the items of our test must be of the same diffi- 
culty. 

Since Ug eror = (125) 
we may from formula (124) write 


ZZ). XQ) x 
te = hk MEE з (126) 


Summing over all values of X and averaging we obtain 
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ім Жж x 
i ee (127) 


K 4 4(N-1) 4(N-1 4 
where M is the mean of X. 
M X M X: м 
H + ЖЭР. —s =s кере; ЫТ Se 
епсе 92.774 INI) +10727) + 4 (2 were (128) 
which, since с? = X: — M° 
° 1 2 
reduces to = тт) [MIN-M F(N] — ........ (129) 


This formula gives the variance of scores on either the odd or the even items. 
But the reliability of a test on the assumption that the variance of scores on 
the odd items is equal to the variance of scores on the even items is given by 
40? 
7-2 а А ЖЛЕ (130) 


2 
0% 


Substituting equation (129) in equation (130) we obtain 


ow! ТЕ M(N-M) 
u= Ni Ne edi (131) 


This formula is identical with the Kuder-Richardson formula (21). It is 
probable that the Kuder-Richardson formula (20) also can be derived by a meth- 
od similar to that described above by taking into consideration the fact that 
the items may be biased. 
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